- Methodology article
- Open Access
Simulation of microarray data with realistic characteristics
- Matti Nykter^{1}Email author,
- Tommi Aho^{1},
- Miika Ahdesmäki^{1},
- Pekka Ruusuvuori^{1},
- Antti Lehmussola^{1} and
- Olli Yli-Harja^{1}
https://doi.org/10.1186/1471-2105-7-349
© Nykter et al; licensee BioMed Central Ltd. 2006
- Received: 15 November 2005
- Accepted: 18 July 2006
- Published: 18 July 2006
Abstract
Background
Microarray technologies have become common tools in biological research. As a result, a need for effective computational methods for data analysis has emerged. Numerous different algorithms have been proposed for analyzing the data. However, an objective evaluation of the proposed algorithms is not possible due to the lack of biological ground truth information. To overcome this fundamental problem, the use of simulated microarray data for algorithm validation has been proposed.
Results
We present a microarray simulation model which can be used to validate different kinds of data analysis algorithms. The proposed model is unique in the sense that it includes all the steps that affect the quality of real microarray data. These steps include the simulation of biological ground truth data, applying biological and measurement technology specific error models, and finally simulating the microarray slide manufacturing and hybridization. After all these steps are taken into account, the simulated data has realistic biological and statistical characteristics. The applicability of the proposed model is demonstrated by several examples.
Conclusion
The proposed microarray simulation model is modular and can be used in different kinds of applications. It includes several error models that have been proposed earlier and it can be used with different types of input data. The model can be used to simulate both spotted two-channel and oligonucleotide based single-channel microarrays. All this makes the model a valuable tool for example in validation of data analysis algorithms.
Keywords
- Error Model
- Error Source
- Ground Truth Data
- Slide Image
- Spot Shape
Background
The emergence of several high throughput measurement technologies provides new possibilities to study biological organisms at the system level. New technologies produce such large amounts of data that can no longer be analyzed by hand. This has made computational techniques an inseparable part of data analysis. Although new computational methods are continuously proposed for data analysis, their performance can not be objectively evaluated. This remains as a fundamental problem in method development. Typically validation of data analysis methods is based on clinically determined labels of biological samples. If the computational method produces results which are consistent with the predetermined labels, then the method is considered to work reliably. This approach, however, relies entirely on a priori information about the data. Furthermore, the clinical classification of samples is not always unambiguous [1, 2].
A more objective approach to validate the data analysis methods is to use data whose characteristics and ground truth are known [3, 4]. Unfortunately, in real life problems this kind of data usually does not exist. Thus, to obtain data with known ground truth, one needs to produce the data by simulation. If simulated data is used to evaluate the performance of the analysis methods, can it be guaranteed that the same performance is obtained with real data also? To get meaningful results, the simulated data and the real biological data have to have similar biological and statistical characteristics.
A problem in the validation of data analysis algorithms using simulated data is that there is always an underlying mathematical model that is used to simulate the data. Thus, when different computational methods are compared, this approach favors the ones that implement the same assumptions as the data generation process does. While this is a fundamental problem, it can be circumvented by evaluating the methods using simulated data produced by different kinds of models. When the results are combined, the bias due to the model assumptions can be avoided.
Numerous studies have focused on mathematical modeling of biological and measurement errors, including both stochastic noise and systemic bias [5–11]. These studies have improved the analysis methods by utilizing the knowledge about the data properties [7]. This knowledge can be utilized in the generation of simulated data as well.
The error model itself is not enough for the simulation of biologically and statistically accurate data. Before an error model can be applied, the ground truth biological signal needs to be obtained. Depending on the application, a biological signal can be obtained for example by sampling a proper distribution or by modeling and simulating the biological system using differential equation models [12].
Once the biological ground truth signal has been generated and the error model has been applied, simulated data is still not comparable to real measurement data. Real data is always extracted from a measurement system. In the case of gene expression microarrays, image processing algorithms are used to read the spot values from the scanned slide image. The applied grid alignment, segmentation and data extraction algorithms have a significant effect to the obtained data [13].
There are numerous possible applications for a simulation model that can simulate realistic biological measurement data. The most obvious application is the validation and improvement of data analysis algorithms [3, 4, 14]. In addition, different data extraction algorithms can effectively be tested under different noise conditions. If the biological ground truth model is accurate enough one might even be able to simulate entire microarray experiments. If this could be done before performing expensive laboratory experiments, the proposed hypotheses could be tested with simulated data. This could help in finding problems in the design of the experiment and, thus, potentially save significant amount of time and money.
While all the steps of the simulation process have been extensively studied separately [6, 7, 15–18], not much work has been done to combine all the steps. We propose a model that combines these steps and can be used to produce microarray data with realistic biological and statistical characteristics. The proposed model is modular and it can be easily extended to include new error models and even new measurement technologies. The current implementation supports the simulation of spotted two-channel microarrays and oligonucleotide based single-channel microarrays. We have implemented the model in Matlab environment [see additional file 1]. The simulation model is also available for download at our companion web page [19].
Biologically meaningful input data can be obtained from various sources. We introduce some possibilities how this data can be obtained. We then review several previously published error models which model biological and measurement technology specific errors, and which can be used to add realistic statistical properties to the simulated data. The result data is used as a basis for simulating the production of the microarray slides. After that, we discuss about the final step in obtaining realistic measurement data: the extraction of the gene expressions from the slide. Finally we demonstrate the applications of the proposed model by examples.
Generation of the ground truth data
Depending on the application, the requirements for the ground truth data may vary. A typical microarray experiment includes comparison of different classes of samples, measuring a response to a perturbation, or measuring time series behavior. Validation of the data analysis methods developed for each of these applications sets different requirements for the ground truth data.
The simplest approach to generate the ground truth data is to sample data randomly from a specific distribution. First the distribution and its parameters can be estimated from real measurements. Next the ground truth data can be obtained by sampling a simulated ideal distribution with estimated parameters [7, 15]. This approach can be adequate for several applications. The detection of differentially expressed genes is often based on the comparison of statistical properties of microarray data from two different samples, for example from two different cancer types. Therefore, the ground truth data suitable for validating data analysis methods can be obtained simply by sampling two distributions with different parameters.
If purpose of the data analysis is to study the behavior of the system in more detail, for example to study responses to perturbations, then biologically more detailed data can be generated. Because microarray technology measures gene expressions, the natural source for biological data would be a model of a genetic regulatory network (GRN). Unfortunately, GRNs are not generally known well enough so that they could be utilized in data simulation [20].
However, in some cases parts of the networks are known and even simulation models that include parts of the genetic regulatory mechanisms have been proposed [12, 17]. These kinds of models would be ideal for the generation of ground truth data. If a model is accurate enough, even hypotheses about the behavior of the real system could be tested before a real microarray experiment is done.
Generation of data with biologically meaningful characteristics does not require the modeling of real GRNs [18]. Instead one can use networks with random topology. If the interactions between network components are modeled properly, for example by utilizing interaction information from real GRNs, one could produce data with realistic characteristics [20].
Once the network model has been obtained and mathematical models for interactions have been formulated, the expression values of the genes in the network need to be simulated. There are several publicly available software packages that can be used to accomplish this task [21, 22].
Yet another application for microarray data is network inference, that is, learning the network structure and the interaction rules between the network components from time series or perturbation measurements.
In network inference, the modeling of control mechanisms of a network plays an essential role. Therefore, it is not necessary that all the interactions correspond to the ones of a real network and thus, even coarse scale models can be used. For example, it is shown that very simple models, even random Boolean networks, can capture some of the essential characteristics of real GRNs [23, 24]. Thus it may be sufficient to use for example a Boolean network as a ground truth in network inference studies [25].
Real measurement data can also be used as ground truth data. This is the case, for example, if we want to study how our data analysis algorithm performs under different types and amounts of noise. By adding noise to real measurement data we can effectively test if the performance of our data analysis algorithms degrades as the amount of noise increases. This can give us valuable insight into the robustness of the algorithms.
Microarray simulation model
In this section, a model for microarray measurements is presented. The model can use input data from numerous different sources. In practice, there is no limitation on what kind of simulator or software is used to generate the ground truth data.
List of noise parameters. Noise parameters available in the microarray simulation model.
kernel | Kernel used to model the population effect. |
---|---|
copies | Number of times the population effect is applied. |
errormodel | Error model to be used; each error model has its own parameters, see Table 5. |
List of slide parameters. Overview of the slide simulation parameters. More detailed documentation of the parameters is available on the companion web page [19].
S _{ type } | Type of the slide (single or two channel). |
---|---|
S _{ spot } | Model used for the spot: circle, Gaussian, hyperbolic. |
S _{ pix } | Maximum width/height of the area for the spot in pixels. |
S _{ movprob } | Probability for a spot to drift (move) from designated location. This parameter models random movement. See parameter B_{ curve }for systematic drift. |
S _{ mov } | Maximum allowed movement bias from designated location, movement in x-axis S_{ x }and y-axis S_{ y }are drawn from uniform distribution U (-S_{ mov }, S_{ mov }). |
S _{ μ } | Mean radius of the simulated spot. Spot radius is drawn from N (S_{ μ }, ${S}_{{\sigma}^{2}}$) distribution. |
${S}_{{\sigma}^{2}}$ | Allowed variation (variance) of the spot size. |
P | If set, print tip leaves a mark to the spot. |
P _{ p } | Probability for print tip mark to be visible in a spot. |
P _{ h } | Maximum height of the print tip mark, print tip height is drawn from U (0, P_{ h }) distribution. |
P _{ w } | Maximum width of the print tip mark, print tip width is drawn from U (0, P_{ w }) distribution. |
P _{ b } | Maximum of how much print tip mark is allowed to drift from spot center. Movement in x-axis P_{ x }and y-axis P_{ y }are drawn from U (-P_{ b }, P_{ b }). |
C _{ prob } | Probability for a spot to suffer from a chord cut. |
C _{ num } | Maximum number of chord cuts from a spot. |
C _{ cut } | Maximum depth of the chord cut, cut depth is drawn from U (0, C_{ cut }). |
N _{ slides } | Number of slides to be generated. |
N _{ time } | Time points when slides are made. This is relevant only for time series data. |
N _{ channels } | Number of channels (different dyes) on the slide. |
N _{ spots } | Total number of spots on the slide. |
N _{ height } | Number of rows of spots on the slide. |
N _{ width } | Number of columns of spots on the slide. |
B | Subarray layout on the slide i.e. number of (subarray)rows and (subarray)columns. |
B _{ space } | Space between individual subarrays on the slide. |
B _{ curve } | Parameter used to control the subarray curving (i.e. systematic drift in spot printing). |
B _{ maxc } | Maximum distance the bin is allowed to curve, curvature parameter is drawn from U (0, B_{ maxc }). |
B _{ spots } | Number of spots in each subarray. |
B _{ height } | Number of rows in subarrays. |
B _{ width } | Number of columns in subarrays. |
List of hybridization parameters. Overview of the hybridization effect parameters.
${H}_{{\sigma}^{2}}$ | Multiplicative Gaussian hybridization noise variance. Hybridization noise is drawn from N (0, ${H}_{{\sigma}^{2}}$). |
---|---|
H _{ errors } | If set, hybridization errors are included in simulation. |
H _{ bgnoise } | Percent of the intensity values covered by the background noise. |
H _{ bgvar } | Background noise variance, relative to background noise mean determined using H_{ bgnoise }. |
H _{ bggrad } | Gradient (noise pattern) for background noise. |
H _{ noscratch } | Number of scratches on the slide. |
H _{ slength } | Maximum length of the scratch, scratch length is drawn from U (0, H_{ Slength }). |
H _{ Swidth } | Width of the scratch. |
H _{ noair } | Number of air bubbles visible on the slide. |
${H}_{{\mu}_{air}}$ | Mean for the air bubble radius, drawn from N (μ_{ air }, ${\sigma}_{air}^{2}$). |
${H}_{{\sigma}_{air}^{2}}$ | Allowed variation (variance) for air bubble size radius. |
H _{ bleed } | Percent of spots having dye outside spot area (bleeding). |
H _{ bleedsize } | Size of the spot bleed (how many times the spot size). |
H _{ bleeddist } | How far from the origin the bleeding goes. |
List of scanner parameters. Overview of the scanner effect parameters.
R _{ power } | Scanner power is used for histogram equalization, more power yields brighter image. |
---|---|
R _{ b } | The dynamic range of the scanner. Intensity values are quantized to ${2}^{{R}_{b}}$ interval. |
R _{ eq } | If set, histogram equalization is applied. |
R _{ th } | Threshold parameter for quantization, values over the threshold are saturated. |
R _{ Rch } | Number of channel that is considered as red dye. |
R _{ Gch } | Number of channel that is considered as green dye. |
R _{ errors } | If set, scanner errors are applied. |
R _{ angle } | Angle at which the slide is scanned. |
R _{ mm } | Misalignment between red and green channel. |
List of error models. Error models (EM) and the parameters for each of the implemented error model. Noise free input data is denoted by x and the noisy output data by y. Index i refers to gene, j to array (chip), and k to biological sample specific noise. Index p refers to a specific probe within a probe set.
Simple EM: | |
---|---|
Model | Additive Gaussian noise is added to the data. |
μ | Mean of the additive Gaussian noise. Noise is drawn from N (μ, α^{2}) |
α ^{2} | Variance of the additive Gaussian noise. |
SNR EM: | |
Model | Additive Gaussian noise is added to the data with given signal-to-noise ratio. |
μ | Mean of the additive Gaussian noise. |
SNR | Signal-to-noise ratio after the noise is added. |
Dror EM [7]: | |
Model | y = g * (x_{ i }* x) + f + ε |
${\mu}_{{x}_{i}}$, ${\sigma}_{{x}_{i}}^{2}$ | Binding efficiency of each probe x_{ i }is drawn from Gaussian distribution N (${\mu}_{{x}_{i}}$, ${\sigma}_{{x}_{i}}^{2}$). |
μ_{ f }, ${\sigma}_{f}^{2}$ | Gene specific bias f is drawn from Gaussian distribution N (μ_{ f }, ${\sigma}_{f}^{2}$). |
α_{ ε }, β_{ ε } | Gene and chip specific error ε is drawn from Laplace distribution L (α_{ ε }, β_{ ε }). |
μ_{ g }, ${\sigma}_{g}^{2}$ | Multiplicative gene and chip specific noise g is drawn from log-normal distribution LN (μ_{ g }, ${\sigma}_{g}^{2}$). |
Hartemink EM [9]: | |
Model | In log scale y = x + ρ_{ j }+ ε_{ ij } |
${\mu}_{{p}_{j}}$, ${\sigma}_{{p}_{j}}^{2}$ | Chip specific bias ρ_{ j }is drawn from Gaussian distribution N (${\mu}_{{p}_{j}}$, ${\sigma}_{{p}_{j}}^{2}$). |
${\sigma}_{{\epsilon}_{ij}}^{2}$ | Gene and chip specific error ε_{ ij }is drawn from Gaussian distribution N (0, ${\sigma}_{{e}_{ij}}^{2}$). |
Hierarchical EM [6]: | |
Model | In log scale y = X + ε, X = x + g_{ i }+ c_{ j }, + r_{ ij }+ b_{ ijk } |
${\sigma}_{\epsilon}^{2}$ | Independent random noise ε is drawn from zero mean Gaussian distribution N (0, ${\sigma}_{\epsilon}^{2}$). |
${\sigma}_{{g}_{i}}^{2}$ | Gene specific noise g_{ i }is drawn from zero mean Gaussian distribution.N (0, ${\sigma}_{{g}_{i}}^{2}$). |
${\sigma}_{{c}_{j}}^{2}$ | Chip specific noise C_{ j }is drawn from zero mean Gaussian distribution N (0, ${\sigma}_{{c}_{j}}^{2}$). |
${\sigma}_{{r}_{ij}}^{2}$ | Gene and chip specific noise r_{ ij }is drawn from zero mean Gaussian distribution N (0, ${\sigma}_{{r}_{ij}}^{2}$). |
${\sigma}_{{b}_{ijk}}^{2}$ | Gene, chip and biological sample specific noise b_{ ijk }is drawn from zero mean Gaussian distribution N (0, ${\sigma}_{{b}_{ijk}}^{2}$). |
Rocke EM [8]: | |
Model | y = α + xe^{ n } + ε |
${\sigma}_{n}^{2}$ | Multiplicative noise n is drawn from zero mean Gaussian distribution N (0, ${\sigma}_{n}^{2}$). |
${\sigma}_{\epsilon}^{2}$ | Additive independent noise ε is drawn from zero mean Gaussian distribution N (0, ${\sigma}_{\epsilon}^{2}$). |
μ_{ α }, ${\sigma}_{\alpha}^{2}$ | Background noise (bias) α is drawn from Gaussian distribution N (μ_{ α }, ${\sigma}_{\alpha}^{2}$). |
Hein EM [11]: | |
Model | PM_{ ijkp }~ N(S_{ ijkp }+ H_{ ijkp }, ${\tau}_{jk}^{2}$), MM_{ ijkp }~ N (φS_{ ijkp }+ H_{ ijkp }, ${\tau}_{jk}^{2}$), where PM refers to perfect match and MM to mismatch probe. |
a_{ k }, ${b}_{k}^{2}$ | True expression signal log(S_{ ijkp }+ 1) is drawn from truncated (realization always ≥ 0) Gaussian distribution TN (x, ${\sigma}_{ik}^{2}$), where variance ${\sigma}_{ik}^{2}$ is drawn from Gaussian distribution N (a_{ k }, ${b}_{k}^{2}$) and x is the underlying expression value. |
μ_{ λ }, ${\sigma}_{\lambda}^{2}$, α_{ η }, β_{ η } | Hybridization error term log(H_{ ijkp }+ 1) is drawn from truncated Gaussian distribution TN (λ_{ jk }, ${\eta}_{jk}^{2}$). Parameter λ_{ jk }is drawn from Gaussian distribution N (μ_{ λ }, ${\sigma}_{\lambda}^{2}$) and ${\eta}_{jk}^{2}$ is drawn from gamma distribution Γ^{-1}(α_{ τ }, β_{ τ }). |
α_{ τ }, β_{ τ } | Variance ${\tau}_{jk}^{2}$ is drawn from gamma distribution Γ^{-1}(α_{ τ }, β_{ τ }). |
φ | Fractional binding φ can be selected from interval [0, 1]. |
File input
Input data requirements. Requirements for the simulator input data used in microarray simulation.
data | Expression values or ratios measured for probes (genes). One value for each time instant per probe is required. |
---|---|
time | Time instants when the expression values are obtained. |
genes | Names of the probes. |
spot | Location of each probe on the slide (x and y coordinate). |
name | Name of the dataset. |
type | Type of the input data i.e. cDNA or oligonucleotide expression or ratios. |
scale | Scale of the input data, i.e. log or linear scale. |
Biological and measurement noise
The most important part in the simulation of realistic microarray data is the modeling of biological and measurement technology specific errors because they define the statistical characteristics of the simulated data. Biological errors are typically considered to include the internal stochastic noise of the cells and error sources related to sample preparation [16, 26]. This type of intrinsic noise is present in all measurements, regardless of the measurement technology. Measurement errors, on the other hand, include error sources that are directly related to the measurement technology and its limitations, for example bias due to the used dyes. The properties of this kind of extrinsic noise depend on the measurement technology [5]. In addition to the fact that the simulated ground truth data is measurement error free, there is another major difference compared to real microarray data. Microarray data are usually measurements from cell populations. Thus the measured values are average expression values of all the cells in the population while the simulated data essentially presents the behavior of a single cell. Furthermore, it is difficult to prepare a sample containing only one type of cells. Therefore, the measured data is typically from a heterogeneous cell population, for example from a mixture of different types of cells [27]. The simulated data can be made more realistic by introducing a population effect. This can be done by using a kernel function to spread the ideal expression patterns as proposed in [28]. The population effect blurs the simulated ground truth data so that all the details can not be observed. Small variations occurring only in some cells can not be observed because they are covered by the large trends of the majority of the cells.
After the population effect has been taken into account, we can add biological and measurement errors to the simulated data. There have been numerous studies characterizing the properties of the error sources [5–11]. While the formulations of different error models are slightly different, the main components in all the models are the same. All of these models contain components that are dependent and components that are independent of the expression level. Thus, the errors are considered to be nonlinear in nature. Biological and measurement errors can be presented in the compact form
y = f (x) + e, (1)
where f is a nonlinear function, depending on the gene expression level x, e is an error term independent of gene expression level, and y is the observed expression value. Function f includes all error sources that are dependent on the true underlying biological gene expression level x. Thus, error term e and function f include both stochastic noise and systemic bias that originate from biological and measurement technology specific error sources.
To make it possible to estimate the parameters of the error models from real data, error terms are usually factorized into a more detailed form. Typically an error model includes separate terms for gene specific noise, measurement specific noise, array specific noise, biological sample specific noise, noise independent of all these, and so on [6, 7]. Some of the components model the intrinsic noise, that is, errors from biological origin while other components represent the extrinsic noise, that is, errors from the microarray measurement technology. However, usually both of these error types are modeled together regardless of their origin.
As there are error sources that are gene, array and biological sample specific, there needs to be a way to implement all these in the model. In addition to these error sources, there may be technology specific details which have to be considered. Affymetrix type oligonucleotide arrays contain several probes that are a part of the same probe set and thus measure the same gene. Furthermore, perfect match (PM) and mismatch (MM) probes need to be handled independently in the error model [11]. These issues are taken into account in the simulation model design, and all these type of errors can easily be included. For details on how different types of error sources can be implemented, see the documentation available on the companion web page.
Our microarray simulation model includes several error models proposed in the literature [6–11]. Along with the models, methods for estimating model parameters from real measurement data have been proposed [7, 9, 11]. These methods can be used to estimate realistic parameters for the simulation. Some of the implemented error models are for oligonucleotide and some for cDNA data. Thus, to get statistically accurate results the right type of error model needs to be used together with the proper array type. The error models and their parameters are summarized in Table 5. After the error model and the population effect have been applied, the simulated data has realistic biological and statistical characteristics.
Slide manufacturing
To model a real microarray experiment it is not enough to simulate the gene expressions and to apply the error model, but the extraction of the data from slides has to be considered too [13]. Thus we need to model the microarray manufacturing process.
Slide hybridization
While the most relevant of these errors may depend on the array type, the simulation model makes it possible to use the same error sources on both spotted two-channel and oligonucleotide based single-channel arrays. Introduction of these types of error sources might be of interest in validation of grid alignment and segmentation algorithms.
Slide scanning
In real experiments the hybridized slide is digitized by scanning. As a result a digital RGB image is obtained in which each color channel corresponds to the intensity information from different dyes. While the modern scanners are usually of high quality, they still have an effect on the obtained data, for example, in the form of the dynamic range. All scanners have a finite dynamic range, and thus some measurement values might saturate.
The scanner can also be a source for other type of errors. Because the slide is read by scanning each dye color separately, it might be possible that channels do not align perfectly. Furthermore, it is not guaranteed that the slide is always scanned exactly straight. All these types of errors are included into the model.
Image reading
The final step in obtaining the realistic simulation data is to extract the expression values from the image. Because our simulation model produces images similar to real microarray slides, one can conveniently use any microarray feature extraction software.
We have however included an automatic grid alignment and image segmentation algorithm into the simulation model so that the data can be automatically extracted from images. These default algorithms can be easily replaced by other extraction algorithms.
Results and discussion
As the final application example, we present how the proposed simulation model can be used for comparing spot segmentation algorithms. Spot segmentation, along with procedures such as spot addressing and estimation of background and foreground levels, is one of the successive steps affecting the estimation of the true signal intensity. Simultaneous comparison of all the methods affecting the estimated true signal is a complex problem which would require more attention in order to be thoroughly studied. In our current example we estimate the spot and the background intensities by calculating the mean of segmented foreground and background pixels. Thereafter, the expression value is obtained by subtracting the background intensities from the foreground intensities. Our comparison example includes three different segmentation algorithms: The fixed circle (FC) method [34], the histogram segmentation (HST) method [35], and the seeded region growing (SRG) method [36].
Segmentation results. Correlation coefficients between the estimated spot intensities and the input data. Histogram segmentation gives the highest correlation with the reference data. All methods give poorer correlations as the image quality is degraded.
Algorithm | Results for image 1 | Results for image 2 | Results for image 3 |
---|---|---|---|
FC | 0.9952 | 0.9112 | 0.8452 |
HST | 0.9962 | 0.9860 | 0.9432 |
SRG | 0.9876 | 0.9602 | 0.8680 |
Conclusion
The previously proposed microarray simulation models have been suitable for specific simulation tasks only. The model we have proposed is modular and can be used in different kinds of analyzes. One of the most important properties of the proposed model is the ability to use almost any kind of input data. Most models are limited to specific types of data, typically random data drawn from a predetermined distribution. Thus, they can not exploit other data, such as data produced by network simulation. In addition, the proposed model utilizes several previously published error models in modeling the biological and measurement technology dependent variation. Thus, the model is not dependent of any specific formulation of noise characteristics, and the performance of the analysis algorithms can effectively be tested under different noise assumptions. Our model also supports both spotted two-channel and oligonucleotide based single-channel microarrays.
We have shown that the proposed model can be used to simulate microarray data which is valuable for validating various kind of data analysis algorithms. As an example, the performance of the microarray segmentation algorithms were compared under different noise conditions.
Declarations
Acknowledgements
This work was funded by the National Technology Agency of Finland and the Academy of Finland, project No. 213462 (Finnish Centre of Excellence program (2006–2011)). We want to thank Raija Lehto and Juho Lahti for implementing the grid alignment and image segmentation tools.
Authors’ Affiliations
References
- Trotter MJ, Bruecks AK: Interpretation of skin biopsies by general pathologists: Diagnostic discrepancy rate measured by blinded review. Arch Pathol Lab Med 2003, 127(11):1489–1492.PubMedGoogle Scholar
- Nykter M, Hunt KK, Pollock RE, El-Naggar AK, Taylor E, Shmulevich I, Yli-Harja O, Zhang W: Unsupervised analysis uncovers changes in histopathologic diagnosis in supervised genomic studies. Technol Cancer Res Treat 2006, 5(2):177–182.View ArticlePubMedGoogle Scholar
- Quackenbush J: Computational analysis of microarray data. Nat Rev Genet 2001, 2: 418–427. 10.1038/35076576View ArticlePubMedGoogle Scholar
- Wierling CK, Steinfath M, Elge T, Schulze-Kremer S, Aanstad P, Clark M, Lehrach H, Herwig R: Simulation of DNA array hybridization experiments and evaluation of critical parameters during subsequent image and data analysis. BMC Bioinformatics 2002, 3: 29. 10.1186/1471-2105-3-29PubMed CentralView ArticlePubMedGoogle Scholar
- Tu Y, Stolovitzky G, Klein U: Quantitative noise analysis for gene expression microarray experiments. Proc Natl Acad Sci USA 2002, 99(22):14031–14036. 10.1073/pnas.222164199PubMed CentralView ArticlePubMedGoogle Scholar
- Cho H, Lee JK: Bayesian hierarchical error model for analysis of gene expression data. Bioinformatics 2004, 20(13):2016–2025. 10.1093/bioinformatics/bth192View ArticlePubMedGoogle Scholar
- Dror RO, Murnick JG, Rinaldi NJ, Marinescu VD, Rifkin RM, Young RA: Bayesian estimation of transcript levels using a general model of array measurement noise. J Comput Biol 2003, 10(3–4):433–1452. 10.1089/10665270360688110View ArticlePubMedGoogle Scholar
- Rocke DM, Durbin B: A model for measurement error for gene expression array. J Comput Biol 2001, 8(6):557–569. 10.1089/106652701753307485View ArticlePubMedGoogle Scholar
- Hartemink AJ, Gifford DK, Jaakkola TS, Young RA: Maximum-likelihood estimation of optimal scaling factors for expression array normalization. In Proc. SPIE Microarrays: Optical Technologies and Informatics Edited by: Bittner ML, Chen Y, Dorsel AN, Dougherty ER. 2001, 4266: 132–140.View ArticleGoogle Scholar
- Nykter M, Aho T, Kesseli J, Yli-Harja O: On estimation of statistical characteristics of microarray data. Proc. Finnish Signal Processing symposium FINSIG 2003, Tampere, Finland 2003.Google Scholar
- Hein AMK, Richardson S, Causton HC, Ambler GK, Green PJ: BGX: A fully Bayesian integrated approach to the analysis of Affymetrix GeneChip data. Biostatistics 2005, 6(3):349–373. 10.1093/biostatistics/kxi016View ArticlePubMedGoogle Scholar
- Chen KC, Csikasz-Nagy A, Gyorffy B, Val J, Novak B, Tyson JJ: Kinetic analysis of a molecular model of the budding yeast cell cycle. Mol Biol Cell 2000, 11: 369–391.PubMed CentralView ArticlePubMedGoogle Scholar
- Balagurunathan Y, Wang N, Dougherty ER, Nguyen D, Chen Y, Bittner ML, Trent J, Carroll R: Noise factor analysis for cDNA microarrays. J Biomed Opt 2004, 9(4):663–678. 10.1117/1.1755232View ArticlePubMedGoogle Scholar
- Singhal S, Kyvernitis CG, Johnson SW, Kaisera LR, Liebman MN, Albelda SM: Microarray data simulator for improved selection of differentially expressed genes. Cancer Biol Ther 2003, 2(4):383–391.View ArticlePubMedGoogle Scholar
- Balagurunathan Y, Dougherty ER, Chen Y, Bittner ML, Trent JM: Simulation of cDNA microarrays via a parameterized random signal model. J Biomed Opt 2002, 7(3):507–523. 10.1117/1.1486246View ArticlePubMedGoogle Scholar
- Blake WJ, Kærn M, Cantor CR, Collins JJ: Noise in eukaryotic gene expression. Nature 2003, 422(6932):633–637. 10.1038/nature01546View ArticlePubMedGoogle Scholar
- Chen KC, Calzone L, Csikasz-Nagy A, Cross FR, Novak B, Tyson JJ: Integrative analysis of cell cycle control in budding yeast. Mol Biol Cell 2004, 15: 3841–3862. 10.1091/mbc.E03-11-0794PubMed CentralView ArticlePubMedGoogle Scholar
- Mendes P, Sha W, Ye K: Artificial gene networks for objective comparison of analysis algorithms. Bioinformatics 2003, 19(Suppl 2):ii122-iil29.View ArticlePubMedGoogle Scholar
- Simulation of microarray data with realistic characteristics companion web page[http://www.cs.tut.fi/sgn/csb/mamodel/]
- Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I, Zeitlinger J, Jennings EG, Murray HL, Gordon DB, Ren B, Wyrick JJ, Tagne JB, Volkert TL, Fraenkel E, GifFord DK, Young RA: Transcriptional regulatory networks in saccharomyces cerevisiae . Science 2002, 298: 799–804. 10.1126/science.1075090View ArticlePubMedGoogle Scholar
- Mendes P: GEPASI: a software package for modelling the dynamics, steady states and control of biochemical and other systems. Comput Appl Biosci 1993, 9(5):563–571.PubMedGoogle Scholar
- Pettinen A, Aho T, Smolander OP, Manninen T, Saarinen A, Taattola KL, Yli-Harja O, Linne ML: Simulation tools for biochemical networks: Evaluation of performance and usability. Bioinformatics 2005, 21(3):357–363. 10.1093/bioinformatics/bti018View ArticlePubMedGoogle Scholar
- Kauffman SA: Metabolic stability and epigenesis in randomly constructed genetic nets. J Theor Biol 1969, 22: 437–467. 10.1016/0022-5193(69)90015-0View ArticlePubMedGoogle Scholar
- Huang S, Ingber DE: Shape-dependent control of cell growth, differentiation, and apoptosis: Switching between attractors in cell regulatory networks. Exp Cell Res 2000, 261: 91–103. 10.1006/excr.2000.5044View ArticlePubMedGoogle Scholar
- Shmulevich I, Yli-Harja O, Astola J: Inference of genetic regulatory networks under the best-fit extension paradigm. Proc. IEEE – EURASIP Workshop on Nonlinear Signal and Image Processing, Baltimore, Maryland 2001.Google Scholar
- Fraser HB, Hirsh AE, Giaever G, Kumm J, Eisen MB: Noise minimization in eukaryotic gene expression. PloS Biol 2004, 2(6):el37. 10.1371/journal.pbio.0020137View ArticleGoogle Scholar
- Lähdesmäki H, Shmulevich I, Dunmire V, Yli-Harja O, Zhang W: In silico microdissection of microarray data from heterogeneous cell populations. BMC Bioinformatics 2005, 6: 54. 10.1186/1471-2105-6-54PubMed CentralView ArticlePubMedGoogle Scholar
- Lähdesmäki H, Aho T, Huttunen H, Linne ML, Niemi J, Kesseli J, Pearson R, Yli-Harja O: Estimation and inversion of the effects of cell population asynchrony in gene expression time-series. Signal Process 2003, 83(4):835–858. 10.1016/S0165-1684(02)00471-1View ArticleGoogle Scholar
- Brändle N, Bishof H, Lapp H: A generic and robust DNA microarray image analysis. Mach Vision Appl 2003, 15: 11–28. 10.1007/s00138-002-0114-xView ArticleGoogle Scholar
- Ekstrøm CT, Bak S, Kristensen C, Rudemo M: Spot shape modelling and data transformations for microarrays. Bioinformatics 2004, 20(14):2270–2278. 10.1093/bioinformatics/bth237View ArticlePubMedGoogle Scholar
- Hughes TR, Roberts CJ, Dai H, Jones AR, Meyer MR, Slade D, Burchard J, Dow S, Ward TR, Kidd MJ, Friend SH, Marton MJ: Widespread aneuploidy revealed by DNA microarray expression profiling. Nat Genet 2000, 25: 333–337. 10.1038/77116View ArticlePubMedGoogle Scholar
- Bolstad BM, Irizarry RA, Åstrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003, 19(2):185–193. 10.1093/bioinformatics/19.2.185View ArticlePubMedGoogle Scholar
- Affymetrix[http://www.affymetrix.com/]
- Scanalyze[http://rana.lbl.gov/EisenSoftware.htm]
- Yang YH, Buckley MJ, Speed TP: Analysis of cDNA microarray images. Brief Bioinform 2001, 2(4):341–349. 10.1093/bib/2.4.341View ArticlePubMedGoogle Scholar
- Yang YH, Buckley M, Dudoit S, Speed T: Comparison of methods for image analysis on cDNA microarray data. J Comput Graph Stat 2002, 11: 108–136. 10.1198/106186002317375640View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.