- Methodology article
- Open Access

# Three-parameter lognormal distribution ubiquitously found in cDNA microarray data and its application to parametric data treatment

- Tomokazu Konishi
^{1}Email author

**Received: **19 September 2003

**Accepted: **13 January 2004

**Published: **13 January 2004

The Erratum to this article has been published in BMC Bioinformatics 2004 5:82

## Abstract

### Background

To cancel experimental variations, microarray data must be normalized prior to analysis. Where an appropriate model for statistical data distribution is available, a parametric method can normalize a group of data sets that have common distributions. Although such models have been proposed for microarray data, they have not always fit the distribution of real data and thus have been inappropriate for normalization. Consequently, microarray data in most cases have been normalized with non-parametric methods that adjust data in a pair-wise manner. However, data analysis and the integration of resultant knowledge among experiments have been difficult, since such normalization concepts lack a universal standard.

### Results

A three-parameter lognormal distribution model was tested on over 300 sets of microarray data. The model treats the hybridization background, which is difficult to identify from images of hybridization, as one of the parameters. A rigorous coincidence of the model to data sets was found, proving the model's appropriateness for microarray data. In fact, a closer fitting to Northern analysis was obtained. The model showed inconsistency only at very strong or weak data intensities. Measurement of *z*-scores as well as calculated ratios was reproducible only among data in the model-consistent intensity range; also, the ratios were independent of signal intensity at the corresponding range.

### Conclusion

The model could provide a universal standard for data, simplifying data analysis and knowledge integration. It was deduced that the ranges of inconsistency were caused by experimental errors or additive noise in the data; therefore, excluding the data corresponding to those marginal ranges will prevent misleading analytical conclusions.

## Keywords

## Background

Since microarray data contain systematic variations that are derived from various experimental sources, the data should be normalized prior to comparison with other such data. In order to perform such normalization, some stable data characters that represent the data set are found and/or assumed. By making such characters identical, each data set is adjusted to other data sets, to a reference experiment's data, or to a mathematics model. A normalization method is based on ideas or concepts in which elements of data are considered to be the stable characters, and on the design of calculations regarding how data sets are to be adjusted. It is clear that these concepts affect the normalization results; such concepts behind the normalization are often closely connected with the evaluation of differences in data. Indeed, these concepts should originate from experimental observations and/or biologically appropriate assumptions. As an introduction, it might be helpful to describe the concepts on which previously reported methods for microarray data normalization have been based.

Taking ratios with stable elements of the data is one of the simplest methods by which a set of relative data has been normalized. Candidates for such stable elements can be data for house-keeping gene(s), data for control experiments or the median of a set of data. Within a group of data that share this stable element, the calculated ratio can be compared. Such ratio-based methods have been frequently used in the field of molecular biology, since many of the determination methods of mRNA produce relative values. The relative nature is also expected in microarray data. One of the pioneer works in the statistical treatment of microarray data followed the ratio-based scheme [1], assuming a rigid distribution model for the ratios, and allowing objective decisions by seeking the ratio data that exceeded a cut-off value for their deviation. However, it has become clear that such calculated ratios often fluctuate depending on the signal intensity [2–7]. This unstable character, or intensity-dependent effect of measured log-ratio, disagrees with the original assumption, and becomes problematic in data analyses.

Tseng et al. and Yang et al. [2, 3] followed the ratio-based scheme but stabilized the log-ratios by compensating them using the LOWESS technique. In addition, Workman developed a method that used a different calculation technique [4]. Due to the flexibility of their non-linear compensations, the methods can adjust any pair of data sets by resolving the fluctuations. However, even such strong methods could not achieve the assumed stability of the model in regard to log-ratio deviations, by which differences in gene expressions are measured [1]; rather, determined ratios were dependent on signal intensity.

In order to solve the intensity-dependence problem in determined ratios, Huber et al. and Durbin et al. [5, 6] produced a new scheme that recognizes the differences in mRNA levels, not in terms of ratios but in terms of the difference values in arsinh functions, which have stability in their statistical behaviors. The adjustment is performed by linear transformation of one of the pair-wise data sets prior to the arsinh conversion, by trial and improvement, evaluated by likelihood analysis in arsinh values [5]. In fact, this method adjusts the data over the entire range of determination. However, since microarray data might have additive noise [5, 8], which will affect data especially at lower intensities, the stability of deviations at all signal intensities is still doubtful. Additionally, the processed results are not comparable with those of authentic analyses, such as Northern and/or RNase protection assays, since the arsinh function is incompatible with ratios.

An alternative attempt involves the adjustment of a group of data together at one time. Kerr et al. [9] assumed an ANOVA model, in accordance to which data logarithms are linearly adjusted to have the least differences in relation to each other. Since this method treats data sets simultaneously, it finds the most suitable solution among the data sets. However, this method cannot self-evaluate the appropriateness of the model and the design of the adjusting process. Additionally, this process requires an inordinate amount of calculation and, if a set of data is added afterwards, all the calculations have to be performed from the beginning.

In order to reduce the amount of calculation, it is preferable to adjust each data set in terms of a rigid model. Such a method can normalize data sets one by one, and the normalized data can be compared directly with each other without further adjustment. Additionally, if the model is appropriate, normalization will be highly accurate. In such model-based normalization methods, the simplest assumed stability might be the total amount of mRNA in a tissue, which forms the basis of the normalization of the total intensity [7, 10] or the global standardization [11, 12]. However, the appropriateness of the model in terms of the stability, not of the total amount of mRNA but of the sum of the determined numerals, is difficult to evaluate; the determined data is not always proportional to the amount of mRNA, since the determined data contains background, the value of which is difficult to estimate exactly [13]. Some alternative methods assume a model for the statistical distribution of data. In many cases, a lognormal distribution would be the optimal model for microarray data, and indeed this distribution has been reported for some data sets [2, 14, 15]. Additionally, Hoyle et al. [16] have found that microarray data are in agreement with both Benford's law and Zipf's law, and suggested the lognormal model and power law model to be good candidates for assumptions concerning the distribution. However, the real data distributions sometimes do not fit closely to these models [9, 16]. Such inappropriateness in a model can be found to skew data histograms or probability plots.

The intensity data of microarray experiments always contain a certain level of background [8], and inadequate estimation of this background can affect the assumed stabilities in ratio-based normalization methods as well as lognormal data distributions. Background has been estimated based on the hybridization image [13], which is then subtracted from intensity data; this estimation technique is based on the supposition that the background level is consistent between the DNA spot and the surrounding space. However, because surface properties may differ between the DNA spot and the surrounding space, the respective backgrounds can also differ (a possible extreme example of such difference is the antiprobe [17] with dark DNA spots against a bright surrounding area). Indeed, failure in background estimation will originate intensity dependency of calculated log-ratio; such an effect can even be seen in simulation data [18]. Additionally, an under-estimation of background in both data sets will reduce the differences at lower signal intensities; such a phenomenon has been observed in determinations evaluating a microarray's mechanical characteristics [19]. Adding or subtracting a constant to or from a series of numerals affects the logarithm values in a non-linear way, and biased errors in background estimation can affect the distribution of microarray data in the same manner.

In this article, a model-based normalization method that finds the background by calculation is introduced. The method assumes stability in data distribution of each set of single-channel microarray data. The method uses a three-parameter lognormal distribution model; since the image-based local background estimation [13] can generate a constant error deriving from the different surface properties on a DNA chip, it is reasonable to handle the background as an unknown quantity. The three-parameter model was established by introducing the unknown as the threshold parameter to a lognormal distribution model [15]. To maintain the objectivity of data treatment, the parameter is restricted as a common constant within a single-channel data set; this treatment is based on assumptions that the background is mainly provoked by non-specific binding of pigments to DNA spots, and that each DNA spot binds a fixed amount of the pigments. The common constant is found as the value that, when universally subtracted, produces a data distribution that most closely fits the model. The appropriateness of the assumed distribution model is evaluated by means of coincidence between the model and the resulting data distribution in many different microarray experiments. Additionally, a ratio-based treatment of the normalized data is introduced. The stability of signal versus ratio relationship is shown below, as well as a correlation with Northern blot analyses.

## Results

### Fitness of the three-parameter model to data

*A. thaliana*,

*B. subtilis*,

*C. elegans*, and

*E. coli*(some examples are shown in Fig. 2), as well as in commercially available DNA chips such as Atlas Glass Array (Clontech), and synthetic oligonucleotide probe chips such as GeneChip (Affymetrix) and Agilent Oligo microarray (data not shown).

During the threshold parameter estimation process, the parameter became larger than some of the intensity data, producing negative values whose logarithms could not be calculated. In the experiment shown in Figure 1, for example, 1.2% of the data fell into this class, although numbers of such data were variable between experiments. Since microarray data might contain a certain level of additive noise [8], it is highly possible that some of the DNA spots produce signals so faint that the negative noise can mask them completely. Consequently, those data were simply treated as "signals not detected".

### Appropriateness of the three-parameter model evaluated from technical reproducibility

Although microarray data was consistent with the distribution model over a wide range of intensities, no data showed perfect consistence. Rather, the probability plots necessarily bent downwards at lower intensities (Figs. 1,2). Such discrepancies could be caused either by additive noise in the measurement system or by the unsuitability of the model. These possibilities were verified through data reproducibility, which was confirmed on a scattergram with repeated hybridization experiments involving the same RNA sample. Since the breakdown of the model occurs at different intensities, each data set is consistent with the model at a different range of signals. Naturally, we cannot expect data reproducibility at the inconsistent ranges where the data does not obey the model. In such ranges, if the inconsistency occurs because of model unsuitability, the normalization may give the data a biased error; such bias will make a bend(s) in the scattergrams between repeated experiments. Alternatively, if the inconsistency is caused by noise, the noise will affect the reproducibility by dispersing the scattergram.

The scattergrams for repeated experiments showed consistent data reproducibility (Fig. 3). Most of the data were plotted within the 1.4-fold difference lines (the method for calculating the ratio is described below) above the breakdown levels of the model. In contrast, in the discrepancy region (red dots), the reproducibility was lost and the data became randomly plotted (Fig. 3), suggesting that the discrepancies between data and the model were caused by noise rather than unsuitability of the model. In some cases, the upper part of the data also bent down below the y = x line in the probability plots (Fig. 2, ID 1593 and 15973). Such a breakdown was typically found in cases in which the signals of the intensity data were relatively large. Such breakdowns may be caused by the saturation of array scanners, which may ruin the signal response. Since the diameters of the DNA spots and also the DNA concentrations within each spot are uneven, such saturation will also add a level of noise to the data, rather than just creating distortion in the linear signal response. Actually, the plotted data above the upper limit of linearity showed inconsistent reproducibility (Fig. 3, panel b). There were no bends observed in the scattergrams, showing the appropriateness of the three-parameter model, and contradicting the dye-specific alterations to data (panels c and d).

### Stable nature of σ values found from logarithms of γ subtracted microarray data

The stability of shape parameter σ values in human fibroblast data.

experiment | σ | SD | n |
---|---|---|---|

control (ch1) | 0.65 | 0.08 | 24 |

grid 1 | 0.61 | 0.08 | 6 |

grid 2 | 0.66 | 0.06 | 6 |

grid 3 | 0.65 | 0.09 | 6 |

grid 4 | 0.69 | 0.08 | 6 |

treatment (ch2) | 0.65 | 0.06 | 24 |

grid 1 | 0.69 | 0.06 | 6 |

grid 2 | 0.68 | 0.04 | 6 |

grid 3 | 0.65 | 0.06 | 6 |

grid 4 | 0.60 | 0.04 | 6 |

### Comparison of the normalized data on a ratio basis

The obtained results demonstrate that we can expect lognormal distributions in microarray data. Within the range in which data obey the distribution model, logarithms of the data can be normalized to *z*-scores. How, then, can we evaluate the change of expression levels presented in *z*-scores? It will, of course, be useful if the normalized data can also be compared to results obtained by conventional methods, such as Northern analysis. In most conventional analyses, ratios are used to indicate differences in expression levels. Since such analyses do not provide information about the distribution of expression levels, the *z*-scores cannot be calculated. In order to normalize the data, the amounts of total RNA, rRNA, and/or housekeeping genes are commonly used as standards instead. Under such limitations, ratio methods are a convenient choice for evaluating the differences in gene transcript levels, i.e. the number of mRNA molecules transcribed from a gene and accumulated in a cell.

Assuming stability in the population distribution or transcript levels of genes, ratios can be calculated from *z*-scores obtained from microarray experiments according to the following formula. Since background-subtracted microarray data may have a linear relationship to the transcript level, each datum can be expressed as

(datum for *i* th spot at *j* th hybridization) = a_{
j
}b_{
i
}*x*_{j,i},

where a_{
j
}is a factor that compensates for differences in sensitivities of detection between hybridization experiments, b_{
i
}is another sensitivity compensation factor between different DNA spots on a DNA chip, and *x*_{j,i}is the transcript level of a gene. Consider two sets of background-subtracted data, a_{1}b_{
i
}*x*_{1,i}and a_{2}b_{
i
}*x*_{2,i}for *i* = 1...n, in different hybridizations on identical array chips. Since the same normal distribution is assumed for log(*x*_{1,i}) and log(*x*_{2,i}), the values for the shape parameter (σ) are the same. According to *z*-normalization of the data, the normalized data will be,

*Z*_{
j,i
}= {log(a_{
j
}b_{
i
}*x*_{j,i})-μ_{
j
}}/σ

where μ _{
j
}is the observed scale parameter for each hybridization experiment. The difference in the normalized data between *Z*_{1,i}and *Z*_{2,i}can be presented as

*Z*_{1,i}- *Z*_{2,i}= {log(a_{1}b_{
i
}*x*_{1,i})-μ_{
1
}}/σ-log(a_{2}b_{
i
}*x*_{2,i})-μ_{
2
}}/σ

={log(*x*_{1,i}/*x*_{2,i}) +log(b_{
i
})-log(b_{
i
})+log(a_{1})-log(a_{2})-(μ_{1}-μ_{2})}/σ. (1)

In this formula, both μ_{1} and μ_{2} are the scale parameters that can be defined as

According to the stable nature on the distribution of *x*_{j,i}, the average of log(*x*_{j,i}),
, will be common between the experiments. Here we can express μ_{1}-μ_{2} appearing in formula (1) as

This leads to the difference of normalized data (1)

*Z*_{1,i}- *Z*_{2,i}= {log(*x*_{1,i}/ *x*_{2,i})}/σ

From this formula, the abundance ratio of RNA, *x*_{1,i}/*x*_{2,i}, can be found from the difference of *z*-scores as

*x*_{1,i}/*x*_{2,i}= 10^{σ*(*Z*_{1,i}- *Z*_{2,i})}.

### Data comparison with Northern analyses

The scattergram showed a close correlation between the results of microarray and Northern blot analyses (panel **a**), suggesting that the proposed treatment of the microarray data provides an appropriate method for analyzing the data. For reference, the same comparison using other normalization methods is shown in panels **b** and **c**, providing more dispersed results with different tendencies. The coincidence shows the appropriateness of the presented data treatments, since the background subtraction is critical to the ratio calculation. If the subtraction were inaccurate, the scattergrams would never coincide.

### Stability in signal intensity versus ratio relationships

## Discussion

The data distributions found in the public resources [20] and rice cDNA microarray [15] demonstrate the appropriateness of the three-parameter lognormal model for microarray data distribution. All the probability plots show wide ranges of coincidence between the normalized data and the model (Fig. 2). Small classes of data, at the largest and smallest intensities, are inconsistent with the model, but these appear to be due to measurement errors rather than the inappropriateness of the method. If this assumption holds true, we can expect lognormal distributions in the transcript levels of genes, i.e. the number of mRNA molecules transcribed from a gene and accumulated in a cell. Since microarrays can be considered as measurements of random samples of the transcript levels, and the population has the same distribution manner with its random samples, the transcript levels must be lognormal. It may be the common nature of cells, since this distribution is found ubiquitously in many experiments on different DNA chips.

The assumed stability in the distribution of transcript levels, which was the basis of the conversion of *z-* scores to ratios, may represent the state of real cells in a sample. Stabilities of the lognormal distribution, the expected distribution of transcript levels, can be observed from those of the two parameters, σ and μ. Since the parameter σ may not be affected by experimental conditions, the value for the population should be the same as that determined from microarray data. The stability of the parameter σ is observed clearly in data (Table 1). Unfortunately, the stability of μ cannot be confirmed from microarray data, which has a relative nature; experimental conditions will affect μ. The relativity is derived not only from the signal detection method, but also from the RNA sample preparation process. However, since σ is stable, changes in μ mean that most of the genes change their expression levels down or up simultaneously. Such synchronous decrease or increase of materials may rarely occur in cells, which otherwise show homeostatic natures.

In a pair of normalized data sets, the ratios calculated by means of the three-parameter model were distributed approximately lognormally and the distribution was found to be stable in relation to the signal intensities (Fig 5). Since each normalization method is based on different assumptions, each of which reflects the criteria used to evaluate the intensity of data or difference between data; different methods can lead to different conclusions. Interestingly, the distribution of log-ratios fortuitously satisfies the assumptions that are used in other normalization methods. For instance, where only a limited number of gene expression levels are changed and they are well-behaved [25], stability of signal versus log-ratio [1–4] and intensity independence of measured log-ratio [1, 5–7] can be expected. Since these assumptions have been adopted from a biological point of view, the stable distributions may be an *a priori* characteristic of the differences in transcript levels. Confirming such a characteristic in real data might be another means of verifying the appropriateness of the parametric normalization method: finding the real background as well as the center and deviation of data distribution, and managing the expressional changes. Certainly, such stabilities will be a great value in data mining on a ratio-basis [1], since a fixed threshold value can be used to select affected genes in sets of experiments. Furthermore, such stability will help in designing comparisons of data [23] by reducing the possibility that different designs will lead very different conclusions.

In normalization of microarray data, treatment of data at lower intensities can seriously affect the calculation results. According to the expected lognormality in signal distributions, the range of signal data necessarily becomes quite wide, and this characteristic complicates exact measurements at the lower and higher intensity ends of the signal. Additionally, unlike errors that are caused by signal saturation, which can be resolved simply by re-scanning the DNA chip at lower excitation intensity, the additive noise is difficult to cancel or reduce. Such additive noise will critically damage faint signals. If such tainted data is included in the normalization process, the additive noise can affect the entire data set. In order to avoid such effects, the choice of a robust calculation method will be important. For example, the parameter calculation used in this article uses data only within the interquartile range. Of course, in cases in which the additive noise becomes comparable to the lower quartile, even this method will become noise-sensitive.

The range of data that are inconsistent with the distribution model should be canceled prior to further bioinformatic analyses, since such data may contain additive noise at a level that seriously affects the signal. Generation of such a data class can be simulated using simple calculations, for example, addition of random numbers to an ideal series that are lognormally distributed. Such noise numbers will create a bend in the probability plot for the resulting series (data not shown). Indeed, in the inconsistent signal range of data, determined *z*-scores showed low reproducibility in repeated hybridizations (Fig. 3) and in the calculated ratios in the dye-swap experiment (Fig. 8). The low reproducibility is not derived from the parametric normalization method, since the corresponding range of data normalized by LOWESS also lacks reproducibility (Fig 7). Cancellation of such data classes by the model does not mean sacrificing a range of measurement; rather, it can prevent a waste of labor, which is often initiated by noise in data.

Many experimental errors are possible sources for additive noise; however, in my experience, critical ones that can affect a large part of data are derived from insufficient signal intensity or uneven hybridization. Shortage in the amount and/or inadequate quality of RNA would lead to the former problem. Unfortunately, re-scanning of hybridized chips with higher extinction power rarely changes the signal/noise ratios; it might expand the noise levels as well as the signal level. Unevenness of hybridization might be compensated with various calculation methods. However, such compensations require information for the differences and/or similarities in the unevenness between the background and the signal; for example, if the unevenness occurs on the backgrounds at the same rate on the signals, compensation should be performed before the g subtraction. In contrast, the parametric normalization method does not determine the level of multiplicative noise from a set of data. The noise can occur from variation of DNA amount in each spot, and this would cause errors in the determination of expressional changes. The error can be cancelled in multi-color comparisons within the same chip's data; however, it will appear in inter-chip comparisons of data.

## Conclusions

A close fit was found ubiquitously between the three-parameter model and real data. The coincidence was stable across biological treatments of subjects. Such commonness and stability in the manner of distribution can be explained without inconsistency if these features are *a priori* characteristics of a living cell.

Using the distribution model, data were successfully handled parametrically. The calculation methods for the data ratios as well as for normalization were introduced. Some characteristics found in the normalized data and in the results obtained from the analysis showed improved data handling in the following categories:

Advances in data accuracy and reliability. Normalized data and calculated ratios showed high levels of experimental reproducibility. Moreover, it was shown that the normalization method could identify the noise-affected ranges of data intensity, allowing for the exclusion of affected data prior to detailed analysis. Calculated ratios and their determination reproducibility were independent of signal intensity.

Expansion of the groups of experiments and of measurement methods that can compare data. The commonness of data distribution suggests that the model-based method may be applicable to a wide range of experiments. At least, the removal of the need for special reference RNA hybridization means that data comparisons are no longer restricted. Additionally, differences between the normalized data can be translated to a ratio basis. Indeed, the calculated ratios correlated closely with those of Northern analysis. It became possible to compare and integrate the ratio-basis results among experiments and/or with other measurement methods.

Comparative table of normalization methods

3-parameter lognormal method [this work] | LOWESS [2.3] | house-keeping genes | globalization [7] | A N O V A [9] | ||
---|---|---|---|---|---|---|

assumed stable character | statistical data distribution | constant ratio tendency | expression levels of particular genes | sum of signal data | smallest differences in log(data) | smallest differences in arsinh(data) |

Can the assumption be verified? | no | yes | no | no | no | |

units for expression level |
| ratios to a reference | ratios to the stable genes | fraction (ppm) | ratios to a reference | statistical differences to a reference |

data transformation for the adjustment | subtraction of a constant | non-linear | no | no | linear on logarithms | linear |

numbers of data sets normalized in a calculation | 1 | 2 | 1 | 1 | all the sets to be compared | 2 |

Can it compare multiple data sets without reference RNA? | yes | no | yes | yes | yes | no |

amount of calculation | medium | medium | least | least | vast | large-medium |

reproducibility | no Fig. 7 | nt | no [7] | no | no Fig. 6cd | |

Is the ratio tendency independent of signal intensity? | yes Fig. 6b | nt | no Fig. 6a | no | yes Fig. 6cd | |

Is the ratio variance independent of signal intensity? | nt | no Fig. 6a | no | yes Fig. 6cd | ||

Can it find the level of additive noise in the data? | yes | no | no | no | no | no |

## Methods

### Data resources

Data used in this article were obtained from open resources at Stanford University [20] or from experiments using rice seedlings [15].

### The lognormal distribution model and estimation of the parameters

The method assumes that the original intensity data, (*r*_{
i
}) for i = 1,2...n, obey a lognormal distribution. The probability density function of the intensity data used was:

*f*(*r*_{
i
}) = (1/σ/(2π)^{1/2})exp[{-log(*r*_{
i
}-γ) + μ^{2}/(2σ^{2})] for *r*_{
i
}>γ,

where σ, μ and γ are the shape, scale and threshold parameters, respectively.

The parameter σ was found through trial and improvement calculation processes; in the trial, the distribution of log(*r*_{
i
}-γ) was checked by normal probability plotting [26], and the value that gave the best fit to the model was selected for γ. The fitness was evaluated by the sum of absolute differences between the model and log(*r*_{
i
}-γ), within the interquartile range of data. The parameter μ was found as the median of log(*r*_{
i
}-γ), and the parameter σ was found from the interquartile range of log(*r*_{
i
}-γ); these are known as robust alternatives for the arithmetic mean and standard deviation, respectively. Parameters μ and σ were found for each data grid, a group of data for DNA spots that were printed by an identical pin in order to avoid divergences caused by pin-based differences [27]. *Z-* normalization was carried out for each datum as

*Z*_{
ri
}={log(*r*_{i-}γ)-μ}/σ.

Intensity data (*r*_{
i
}) less than γ were treated as "data not detected", since such data might contain negative noise larger than the signal (see Results).

### Northern analyses

RNA samples were obtained from a time course experiment on rice seedlings exposed to cold-stress [15]. During the time-course experiment, 7 clones were randomly selected from those showing a higher magnitude of increase or decrease (more than 1.5-fold) from microarray experiments that were normalized with globalization, and 7 clones were selected by a totally random manner. For those clones, northern blotting analyses were performed with the same RNA batch that was used for probing microarray experiments. Radioactivity of detected bands on probed membranes was measured using the BAS system (Fuji). For each band on an image, the signal intensity was detected as the sum of signal values in pixels within the band. The background was estimated based on the average intensities of the electrophoresis lane but excluding the band itself. The relative signal of a band was calculated by subtracting the background from the intensity data. Each signal datum was normalized by creating ratios to the control samples. For 4 clones out of 14 clones, northern analyses could not detect the signals.

## Acknowledgements

I would like to thank M. Araki and K. Takahashi for their assistance in microarray experiments; and Drs. S. Youssefian and H. Wabiko for their comments on the manuscript. Some of the data used were from a study that was supported by an MAFF Rice Genome Project grant, MA-2109. A data normalization service based on this method is commercially available at http://www.skylight-biotech.com.

## Notes

## Declarations

## Authors’ Affiliations

## References

- Chen Y, Dougherty ER, Bittner ML:
**Ratio-based decisions and the quantitative analysis of cDNA microarray images.***J Biomed Optics*1997,**2:**364–374. 10.1117/1.429838View ArticleGoogle Scholar - Tseng GC, Oh MK, Rohlin L, Liao JC, Wong WH:
**Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects.***Nucleic Acids Res*2001,**29:**2549–2557. 10.1093/nar/29.12.2549PubMed CentralView ArticlePubMedGoogle Scholar - Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP.:
**Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation.***Nucleic Acids Res*2002,**30:**e15. 10.1093/nar/30.4.e15PubMed CentralView ArticlePubMedGoogle Scholar - Workman C, Jensen L, Jarmer H, Berka R, Gautier L, Nielser H, Saxild HH, Nielsen C, Brunak S, Knudsen S:
**A new non-linear normalization method for reducing variability in DNA microarray experiments.***Genome Biol*2002,**3:**RESEARCH0048. 10.1186/gb-2002-3-9-research0048PubMed CentralView ArticlePubMedGoogle Scholar - Huber W, Von Heydebreck A, Sultmann H, Poustka A, Vingron M:
**Variance stabilization applied to microarray data calibration and to the quantification of differential expression.***Bioinformatics*2002,**18:**S96–104.View ArticlePubMedGoogle Scholar - Durbin BP, Hardin JS, Hawkins DM, Rocke DM:
**A variance-stabilizing transformation for gene-expression microarray data.***Bioinformatics*2002,**18:**S105–110.View ArticlePubMedGoogle Scholar - Quackenbush J:
**Microarray data normalization and transformation.***Nat Genet*2002,**32**(Suppl):496–501. 10.1038/ng1032View ArticlePubMedGoogle Scholar - Rocke DM, Durbin B:
**A model for measurement error for gene expression arrays.***J Comput Biol*2001,**8:**557–569. 10.1089/106652701753307485View ArticlePubMedGoogle Scholar - Kerr MK, Martin M, Churchill GA:
**Analysis of variance for gene expression microarray data.***J Comput Biol*2000,**7:**819–837. 10.1089/10665270050514954View ArticlePubMedGoogle Scholar - Quackenbush J:
**Computational analysis of microarray data.***Nat Rev Genet*2001,**2:**418–427. 10.1038/35076576View ArticlePubMedGoogle Scholar - Sherlock G:
**Analysis of large-scale gene expression data.***Brief Bioinform*2001,**2:**350–362.View ArticlePubMedGoogle Scholar - Bilban M, Buehler LK, Head S, Desoye G, Quaranta V:
**Normalizing DNA microarray data.***Curr Issues Mol Biol*2002,**4:**57–64.PubMedGoogle Scholar - Yang YH, Buckley MJ, Speed TP:
**Analysis of cDNA microarray images.***Brief Bioinform*2001,**2:**341–349.View ArticlePubMedGoogle Scholar - Olshen AB, Jain AN:
**Deriving quantitative conclusions from microarray expression data.***Bioinformatics*2002,**18:**961–970. 10.1093/bioinformatics/18.7.961View ArticlePubMedGoogle Scholar - Konishi T:
**Parametric treatment of cDNA microarray data.***Genome Informatics*2002,**13:**280–281. [http://www.jsbi.org/journal/GIW02/GIW02P166.pdf]Google Scholar - Hoyle DC, Rattray M, Jupp R, Brass A:
**Making sense of microarray data distributions.***Bioinformatics*2002,**12:**576–584. 10.1093/bioinformatics/18.4.576View ArticleGoogle Scholar - Eisen MB, Brown PO:
**DNA arrays for analysis of gene expression.***Methods in Enzymology*1999,**303:**179–205.View ArticlePubMedGoogle Scholar - Kepler TB, Crosby L, Morgan KT:
**Normalization and analysis of DNA microarray data by self-consistency and local regression.***Genome Biol*2002,**3:**RESEARCH0037. 10.1186/gb-2002-3-7-research0037PubMed CentralView ArticlePubMedGoogle Scholar - Yue H, Eastman PS, Wang BB, Minor J, Doctolero MH, Nuttall RL, Stack R, Becker JW, Montgomery JR, Vainer M, Johnston R:
**An evaluation of the performance of cDNA microarrays for detecting changes in global mRNA expression.***Nucleic Acids Res*2001,**29:**e41. 10.1093/nar/29.8.e41PubMed CentralView ArticlePubMedGoogle Scholar - Gollub J, Ball CA, Binkley G, Demeter J, Finkelstein DB, Hebert JM, Hernandez-Boussard T, Jin H, Kaloper M, Matese JC, Schroeder M, Brown PO, Botstein D, Sherlock G:
**The Stanford Microarray Database: data access and quality assessment tools.***Nucleic Acids Res*2003,**31:**94–6. 10.1093/nar/gkg078PubMed CentralView ArticlePubMedGoogle Scholar - Iyer VR, Eisen MB, Ross DT, Schuler G, Moore T, Lee JCF, Trent JM, Staudt LM, Hudson J Jr, Boguski MS, Lashkari D, Shalon D, Botstein D, Brown PO:
**The transcriptional program in the response of human fibroblasts to serum.***Science*1999,**283:**83–87. 10.1126/science.283.5398.83View ArticlePubMedGoogle Scholar - Ichihara K:
*Statistics for Biosciense – practical teqhnique and theory*Tokyo: Nankodo 1990.Google Scholar - Churchill GA:
**Fundamentals of experimental design for cDNA microarrays.***Nat Genet*2002,**32**(Suppl):490–495. 10.1038/ng1031View ArticlePubMedGoogle Scholar - Horvath DP, Schaffer R, West M, Wisman E:
**Arabidopsis microarrays identify conserved and differentially expressed genes involved in shoot growth and development from distantly related plant species.***Plant J*2003,**34:**125–134.View ArticlePubMedGoogle Scholar - Zien A, Aigner T, Zimmer R, Lengauer T:
**Centralization: a new method for the normalization of gene expression data.***Bioinformatics*2001,**17:**S323–331.View ArticlePubMedGoogle Scholar **NIST/SEMATECH e-Handbook of Statistical Methods**[http://www.itl.nist.gov/div898/handbook/]- Schuchhardt J, Beule D, Malik A, Wolski E, Eickhoff H, Lehrach H, Herzel H:
**Normalization strategies for cDNA microarrays.***Nucleic Acid Res*2000,**28:**e47. 10.1093/nar/28.10.e47PubMed CentralView ArticlePubMedGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.