Intensity dependent estimation of noise in microarrays improves detection of differentially expressed genes
© Zeisel et al; licensee BioMed Central Ltd. 2010
Received: 4 April 2010
Accepted: 27 July 2010
Published: 27 July 2010
In many microarray experiments, analysis is severely hindered by a major difficulty: the small number of samples for which expression data has been measured. When one searches for differentially expressed genes, the small number of samples gives rise to an inaccurate estimation of the experimental noise. This, in turn, leads to loss of statistical power.
We show that the measurement noise of genes with similar expression levels (intensity) is identically and independently distributed, and that this (intensity dependent) distribution is approximately normal. Our method can be easily adapted and used to test whether these statement hold for data from any particular microarray experiment. We propose a method that provides an accurate estimation of the intensity-dependent variance of the noise distribution, and demonstrate that using this estimation we can detect differential expression with much better statistical power than that of standard t-test, and can compare the noise levels of different experiments and platforms.
When the number of samples is small, the simple method we propose improves significantly the statistical power in identifying differentially expressed genes.
DNA microarrays are widely used tools for simultaneous measurement of the expression of thousands of genes. Applications of microarrays include (1) identifying differentially expressed genes between two groups, (2) monitoring typical temporal or spatial profiles of genes and (3) classifying samples on the basis of their gene expression signature. The technical procedure of a typical experiment contains the following steps: (1) mRNA is extracted from the sample cells, (2) mRNA is converted to cDNA, (3) cDNA is amplified and labeled, (4) the labeled cDNA is hybridized to a glass slide containing complementary probes, (5) the slide is scanned by a laser, (6) the image is analyzed using a signal processing algorithm which provides the intensity levels and some quality control information. Two types of microarrays are in common use: two color - in which hybridization is performed on a mixture of (differently labeled) cDNA obtained from two samples; and single color (also known as oligonucleotide chip) - in which each sample is hybridized to a different chip. In this work we focus on single color oligonucleotide microarrays.
Similarly to every other measurement technique, microarray measurements include noise. We define noise in a model-independent way: repeating an experiment many times under identical conditions generates a distribution of the measured quantity (e.g. the intensity of a single gene, measured in the same sample). The fluctuating random variable that gives rise to this distribution is "noise", quantified by the standard deviation of this distribution. Characterizing the noise distribution is important for assessing the statistical significance of observed differential expression.
Differentially expressed genes
When biologists compare the expression levels of a gene between two conditions (A and B), say by real time PCR, they repeat the measurements a few times in each condition. Using these repeats, they can estimate basic properties of the noise distribution - usually the mean and the standard deviation - either directly (as proposed here) or indirectly (as done when a t-test is performed). Without any estimate of the noise it is not possible to assign statistical significance to a discovery of differential expression (i.e. calculate a p-value - the probability to get the measured or larger difference of expression from the random fluctuations of the measurement). Practically implementing this approach when using microarrays is difficult because of two main problems: (1) The high cost of each microarray requires careful design of the number and type of repeats in each condition. An insufficient number of repeats can reduce the statistical power of the experiment, thus lowering the sensitivity to detect differentially expressed genes, while using a large number of repeats is very costly. (2) The high throughput nature of this system enables measurement of thousands of genes simultaneously, while in a typical microarray experiment the desired result is a small subset of genes. This introduces a need for control of false positives (type I errors). In contrast to a single gene experiment, in which classically a p-value of 0.05 suffices, using such a p-value for a microarray (of say 10000 genes) will return ~0.05 × 10000 = 500 genes which result from random noise, masking the true differentially expressed genes we are seeking. To overcome this problem, several methods for multiple testing (e.g. Bonferroni, FDR ) can be used. However, they usually pass only genes with a much lower "naive" p-value, and therefore their use calls for applying more sensitive methods to estimate single gene p-values.
The aim of this work
We present a method that improves the statistical power of testing for differentially expressed genes in experiments with small numbers of repeats (or even no repeats at all). We achieve this by showing that the main factor that governs noise is the intensity (expression level) of a gene. Using this, we estimate the measurement noise on the basis of averaging a rough (single gene based) estimate over a large number of genes with similar intensity (and hence similar noise distribution).
The issues of normalization and statistical analysis of microarray results were the subject of many studies  motivated by the need to reduce the likelihood of reporting false, noise-generated interpretations of the biological systems at hand. In this work we do not deal with the normalization and low-level processing steps and assume that the data are normalized.
Several papers [3–5] identified intensity as a major factor governing microarray induced noise. Novak et al.  studied the reproducibility of replicate microarray experiments by comparing the results of parallel assays done with mRNA probes synthesized from the same mRNA sample. They suggest a linear dependency between the replicates' mean intensity and replicates' standard deviation. Tu et al.  take this idea forward, using a systematic experimental design, which enables them to compare samples taken from the same cell line but from a different dish. They try to discriminate between biological and technical noise. They also fit an exponential function to the standard deviation line, and find that the hybridization noise has the greatest contribution to the total measurement noise. As a practical result from these noise characterizations, they propose a procedure for calculating a p-value for each gene based on its mean expression level and its fold change. Another attempt to address this issue was made by Huber et al. , who presented a variance stabilization approach. They tried to reduce the dependency between the variance and the mean intensity in order to get a uniform (intensity independent ) noise distribution. They were able to do so by a two step transformation of the data: a linear transformation followed by a sinh-1 transformation. After applying this transformation a simple fold change cut-off is equivalent to a p-value cut-off.
A widely used method for detection of differentially expressed genes, which takes into account the intensity-dependence of the noise is SAM (Significance Analysis of Microarrays) . In SAM, an intensity corrected statistic d is calculated for each gene, and significant genes are then identified using comparison to random permutations of the groups. The intensity-dependent correction is obtained by incorporating a constant additive "fudge factor" s0 in the denominator of the standard t-statistic. The exact value of s0 is selected to minimize the dependence of the d statistic on the expression level. Thus, the fudge factor serves to reduce the significance of the noisier low-intensity genes (a detailed comparison with SAM is shown later in the paper).
Another approach provided by Nykter et al.  was to simulate the whole process of microarray measurements starting from the very low level of spot image analysis. This approach is useful in order to simulate large scale microarray data under realistic conditions in order to test and validate data analysis algorithms. A major comprehensive effort to asses the inter- and intra-platform reproducibility of microarrays was done recently by the Microarray Quality Control (MAQC) project . The MAQC study reports intraplatform consistency and high level of interplatform concordance. The goal of this study was to asses reproducibility and not to characterize the noise; this was done in a subsequent study by Klebanov et al. , who used the MAQC dataset to quantify the level of noise in Affymetrix microarrays. They suggest that the average level of noise in technical replicates (without biological variability) is quite low, as exemplified by the lack of bias induced by such noise in pairwise correlation coefficient estimation.
A generally accepted way to model microarray noise is as a combination of (intensity independent) additive and multiplicative components. We claim that such a parametrization does not provide a good description of the noise and its complexity. In Additional file 1: Section 1, Figure S1, we provide evidence supporting this claim. In what follows we show that our intensity-dependent parametrization does have enough flexibility to provide an accurate description of the noise distribution.
Throughout this work we use the following notations:
I i,j - is the measured log2 intensity of a gene i on microarray j.
I0,i, j- is the true underlying (without the microarray induced noise) log2 intensity level of gene i on microarray j.
U - are random variables used in modeling the microarray induced noise.
Throughout the paper we assume that the measured (log) intensity can be written as I = I0 + U(I0), where U is the noise component of the signal which we aim to characterize (note: this representation is general enough to allow for multiplicative noise, as discussed below).
In the Results section we show in a more formal way that for Affymetrix expression data the noise distribution is indeed mostly intensity dependent, and thus satisfies our assumptions. The same method can be used for any other type of data to test validity of these assumptions. The main point of this paper is to actually estimate this i.i.d. (independent and identical distributed) intensity dependent noise. Our proposed approach, which is based on a local noise estimation, is presented in the Results section, which contains also several practical applications to the analysis of microarray experiments. We then demonstrate the advantage of using our approach by analysis of simulated data and of expression data from several experiments. We close with a discussion of the issue of experimental design and with some concluding remarks.
The noise is mostly intensity dependent; formulation and validation
In general we can write the measured signal as I = I0 + U(I0), where U is the noise term, that according to the above figures has an intensity dependent distribution. Note that by expressing the measured signal as I = I0 + U(I0) we are not limiting our noise model to additive noise only; the standard parametrization of the form I = I0f + a can be represented by U(I0) = I0(f - 1) + a. In fact, by allowing a general intensity-dependent parametrization we allow an even more flexible description than a combination of additive and multiplicative noise terms (see Additional file 1 for more details and fit to data). We show below that the distribution of U(I0) is normal with mean zero and intensity dependent variance. In addition, we claim that the noise terms for different genes are independent random variables, and therefore the noise terms of genes with similar intensities are independent identically distributed random variables.
The noise term is normally distributed
Figures 3b and 3c present similar plots (pdf comparison and KS score on the left and Q-Q plot on the right), but here we use genes that belong to specific bins of mean intensity; Figure 3b is for 5.75 ≤ I ≤ 6.25 and 3c is for 8.75 ≤ I ≤ 9.25. Both show good fit to the expected normal distribution (p KS = 0.87 and p KS = 0.15 in Kolmogorov-Smirnov test, respectively). The good fit to normal distributions shown in Figures 3b and 3c suggests that the noise term is i.i.d and indeed approximately normally distributed for genes in the same intensity bin. In Additional file 1: Supplemental Figures S2-S6 we provide similar plots for a wide range of intensities and intensity bin sizes, which further support our claim that the normal distribution is a good approximation, and that this holds when the choice of bin size is varied. A second observation is that the variance of the normal distribution changes as we move from one intensity bin to another. This is exactly the reason for the poor fit in Figure 3a, since the distribution of mixed normal random variables with different variances is not normal. These figures also suggest that the noise terms for different genes are independent (note that we claim that the noise terms are independent and not that the gene expression levels are independent). Additionally, the mean of all the experimental pdfs is zero, as expected from data after a normalization procedure which removes biases.
The noise variance estimator has a χ2distribution
The fit of the distribution of to the χ2 distribution is good. Nevertheless, it should be noted that based only on two repeats, results such as shown in Figures 4a and 4b do not suffice to prove that the noise distribution is i.i.d. The reason is that one can not rule out the possibility of a hypothetical scenario, in which the distribution of variances of the genes is chi-square but in a gene dependent manner (not only intensity dependent and not i.i.d). To rule out this (unlikely) scenario, note that under such a scenario the distribution of should be independent of the number of repeats, whereas (under our assumptions) we predict dependency of the χ2 distribution on the number of repeats. Note that if our predictions are right, in the limit n → ∞ the distribution goes to a delta function at σ2. We see good fits in Figures 4c and 4d to narrower χ2 distributions than those of Figures 4a and 4b. In Figures 4e and 4f we show the dependence of the variance estimator distribution on the number of repeats. As expected, the mean is almost independent of the number of repeats (see Figure 4e), whereas in agreement with eq.(3) the variance decreases (see Figure 4f) as the number of repeats increases. The fact that the distribution of the variance estimator becomes narrower for increasing number of repeats at the same rate as expected by the the chi-square distribution, is consistent with normal i.i.d. distributions of the noise for different genes with similar intensity.
Estimation of the intensity dependent variance
The MAQC project dataset
A naive approach for estimating the noise is to calculate the variance estimator for each gene, one at a time, using the n repeats in hand (for example, as is implicitly done in a t-test or by using SAM). However, in the case of a small sample size large fluctuations in the variance estimation (used while calculating the t statistic) lead to reduced statistical power.
Our proposed method is based on eq. (3); specifically, on the statement that the mean of the variance estimator is the true variance. We use the fact that the noise is intensity dependent, and hence by calculating the mean variance estimator of a (relatively) large number of genes with similar intensity, we get an estimate of the true variance which is significantly improved and stabilized versus the one based on a single gene. Using these intensity dependent standard deviation values, we can assign statistical significance to the differential expression of a gene between two conditions using the more powerful z-test (instead of the t-test or SAM), thus gaining improved statistical power. We demonstrate this improvement on simulated data and on real experimental data.
Analyzing data with n ≥ 2 repeats
For each gene calculate the statistic .
- 3.For each gene test the null hypothesis that there is no difference between the expression of the two conditions. Formally stated, the null hypothesis is:(4)
The above suggested p-value calculation is equivalent to using the z-test which provides higher statistical power than the corresponding t-test.
Performance on simulated data
Performance on an experimental dataset
Figures 8c-8j summarize the number of candidate genes identified by each method (SAM, 2fold, our method) and their overlaps between all consecutive time points using a FDR of 5% (Benjamini Hochberg procedure ). As can be seen, utilizing the intensity-dependent standard deviation yields a highly improved statistical power compared to the two alternative methods (SAM and 2-fold change cutoff) as well as the standard t-test (which identifies differentially expressed genes only in the 2-4 hours comparison). An obvious question that arises is the fraction of true and false positives among the large number of differentiating genes found by our method. An indirect way to test this is to consider whether these differentiating genes make "biological sense". To this end, a gene ontology term enrichment analysis was performed on the differentially expressed genes identified by the z-test method using the DAVID web tool . For all consecutive time points tested, a large and significant enrichment is observed in immune response related genes, such as: cytokine production, cell death, phagocytosis and the mitochondrial electron transport chain (Additional file 1: Supplemental Tables S1-S3). Enrichment for immune system related terms is also observed when testing only genes uniquely identified by our method and not by the alternative SAM or 2-fold methods (Additional file 1: Supplemental Tables S4-S6), while we did not observe such enrichment when testing the lists of genes that were identified uniquely either by SAM or by 2-fold change (Additional file 1: Supplemental Tables S7-S8).
A more direct way to test whether the genes discovered by our method and not by SAM are true positives is by an independent experimental validation. We repeated the experiment of Amit et al  and tested, by qPCR measurements, the expression profiles of 15 selected genes that were found to be differentially expressed by our method (five of which were not found by SAM or t-test), and compared their time-dependent profiles to those obtained by Amit et al using microarrays. Indeed, although the experiment was done on different murine bone marrow dendritic cells by different people at different labs at a different time, the qPCR profiles of the genes looks very close to those derived from the the array-based profiles (Additional file 1: Supplemental Figure S10). In particular, note that we validated the variation of those five genes that were identified by our method as significantly varying and not identified as such by either SAM or t-test.
Analyzing data without repeats
We now turn to the case of a single microarray in each biological condition, without repeats. The only way one can estimate noise in such a case is by viewing the two different conditions as two repeats. Of course, one has to address the question: is it possible to distinguish between differences due to noise, and the true condition dependent changes in gene expression, even when there are no real repeats? Clearly, this scenario is not recommended and one should try to avoid it. Since the technical noise depends mainly on the mean expression level while the real biological variation can be gene specific, naive averaging over the variance estimator ( ) in an intensity bin will produce over-estimation of the variance, because it treats real biological differences also as noise. Under the assumption that the number of genes whose expression levels significantly differ between the two conditions is small, the Robust Variance Estimation procedure described in the Methods section can partially overcome this problem. In this case, we suggest as a heuristic to find for each sample the one or two closest samples (by PCA or any other distance measure), treat these as repeats and apply the Robust Variance Estimation procedure.
Experimental design considerations
where denotes the mean, denotes the variance of the gene, and n denotes the number of repeats in conditions A, B. Taking n A = n B ≡ n for genes with average expression of 4 and 8 (σ2 ≈ 0.2, 0.04 respectively), we can predict the minimal fold changes needed to achieve a 3-sigma threshold, for different numbers of repeats.
Fold change values needed to achieve p = 0.0026 (3σ deviation), as a function of the number of repeats and the intensity.
n = 1
n = 2
n = 3
I = 4 (σ ≈ 0.45)
I = 8 (σ ≈ 0.2)
Discussion and Conclusions
When using DNA microarrays we are faced with the problem of a small sample size combined with a large number of genes, out of which we want to identify a small differentially expressed subset. Therefore, special care must be taken to correctly estimate the distribution of noise for all the genes. Even a small fraction of genes for which we under-estimate the variance can significantly contaminate the results with false positives. We showed here a method for estimation of the technical noise characteristics of the microarrays. Utilizing the fact that the technical noise is intensity dependent, we are able to use hundreds of genes with similar intensity to get an accurate estimation of the noise profile. Using the resulting estimated variance, we propose a reliable and sensitive method to detect significant changes of expression associated with different biological conditions.
It has been previously shown  that the noise level in microarray experiments is low. While being encouraging, this does not mean that this noise can just be ignored. Using a small number of repeats often leads to a noisy estimation of the variance, thus leading to a reduced statistical power when using t-test statistics on each gene separately. This can often lead to non-significant statistics even on genes with a biologically relevant change in expression. On the other hand, as evident from the current study, using the average variance across all genes can lead to false-positives, due to the non-uniformity of variance distribution across genes. By showing that the noise term has approximately normal i.i.d. distribution for genes with similar average intensity, and calculating the intensity-dependent noise profile, one can utilize the low noise levels to accurately identify differentially expressed genes even when the number of samples is small.
Bin size selection
There is no clear criterion to determine the bin size. The selected bin size is the result of a trade-off; on the one hand we wish to have in each bin a large number of points (genes) N, in order to improve the statistics (the variance is estimated on the basis of N variances measured from the data), while on the other hand one wishes to keep the intensity range that defines the bin small (since genes of similar intensity are hypothesized to have the same variance). In such a case one should examine the sensitivity of the results to the bin size used. We find that having N ≥ 200 elements in each bin provides a good estimate for the parameters of the desired distribution. In Figure 3 we use a bin width of 0.5 (in log2 of expression values), which is not optimal, but as can be seen, does provide a distribution very close to normal.
Empirical probability density function estimation
The experimental probability density functions f(x) were evaluated by dividing the range of the variable x that characterizes the data points (e.g. difference between repeats) to about 30 - 50 equal bins. The number chosen for each case depended on how many data points were available; on the one hand, we want to have small fluctuations of the number of points in a bin, and on the other hand, we wish to have a sufficient number of bins to provide information about the probability density. The counts in each bin were normalized so that the integral of the resulting histogram is 1, and a simple smoothing procedure (running average over 5 elements) was applied.
Robust variance estimation
Calculate the naive mean and variance estimators for each gene as described in eq. (1).
Define the neighborhood of gene i as all genes k satisfying , where w is a chosen bin size (can be fixed or varied).
Calculate the intensity dependent variance estimator (IDVE):
- 4.Calculate the two tailed p-value for each gene variance estimator ( ) using the IDVE , as follows:(6)
- 5.Truncate the approximately 1% extreme values of the distribution. This should be done carefully in a way that does not bias the mean for asymmetric distributions (such as chi-square): Define the right cutoff point (outlayers with high variance) by selecting the largest point with p > 0.01 (denote it as x 2). To get x 1, the corresponding cutoff value on the left, we solve the integral equation:(7)
Repeat the process (steps 2 - 5) with the truncated distribution until it converges (i.e. the IDVE does not change).
The data used for this analysis is available on-line from the GEO database. In the first dataset, accession number GSE19921, the RNA measurements (used here as 8 replicates) were done on the breast cancer cell line T47 D, with duplicates taken under the following conditions: wild-type T74 D, shEGFP T47 D and two shRNAs designed to target ERBB3. The 8 microarrays were viewed as replicates since the RNA measurements showed only very minor effects of introducing the shRNA. The second dataset is the MAQC , accession number GSE5350, where we used only the Affymetrix arrays from Site 1. The third dataset (accession number GSE17721) from Amit et al  is expression data of mouse dendritic cells after LPS stimulation.
Data from the Affymetrix 3' arrays used, HG-U133 Plus 2.0 in  and Affymetrix HT-MG-430A in , were preprocessed using the Affymetrix MAS5 algorithm and then corrected by a Lowess multi array procedure. The Affymetrix HuGene 1.0 ST array data were preprocessed using the Affymetrix PLIER algorithm (with no normalization) and then corrected by a Lowess multi array procedure.
independent and identical distributed
intensity dependent variance estimator
probability density function
cumulative distribution function.
We thank Y. Yarden and E. Bublil for providing the data that was used for the analysis, N. Barkai for most useful discussions, and S. Jung and R. Krauthgamer for deriving the murine bone marrow dendritic cells. The work of AZ and ED was supported by the Leir Charitable Foundation, by a Weizmann-Mario Negri collaborative research grant and by a grant from the German-Israeli Project Cooperation (DIP-DFG).
- Benjamini Y, Hochberg Y: Controlling the False Discovery Rate - a Practical and Powerful Approach to Multiple Testing. J R Statist Soc B 1995, 57: 289–300.Google Scholar
- Quackenbush J: Microarray data Normalization and Transformation. Nat Genet 2002, (32 Suppl):496–501. 10.1038/ng1032Google Scholar
- Novak JP, Sladek R, Hudson TJ: Characterization of Variability in Large-scale gene Expression Data: Implications for Study Design. Genomics 2002, 79: 104–113. 10.1006/geno.2001.6675View ArticlePubMedGoogle Scholar
- Tu Y, Stolovitzky G, Klein U: Quantitative Noise Analysis for gene Expression Microarray Experiments. Proc Natl Acad Sci USA 2002, 99: 14031–14036. 10.1073/pnas.222164199View ArticlePubMedPubMed CentralGoogle Scholar
- Huber W, von Heydebreck A, Sültmann H, Poustka A, Vingron M: Variance Stabilization Applied to Microarray data Calibration and to the Quantification of Differential Expression. Bioinformatics 2002, 18(Suppl 1):S96–104.View ArticlePubMedGoogle Scholar
- Tusher VG, Tibshirani R, Chu G: Significance Analysis of Microarrays Applied to the Ionizing Radiation Response. Proc Natl Acad Sci USA 2001, 98: 5116–5121. 10.1073/pnas.091062498View ArticlePubMedPubMed CentralGoogle Scholar
- Nykter M, Aho T, Ahdesmäki M, Ruusuvuori P, Lehmussola A, Yli-Harja O: Simulation of Microarray data with Realistic Characteristics. BMC Bioinformatics 2006, 7: 349–349. 10.1186/1471-2105-7-349View ArticlePubMedPubMed CentralGoogle Scholar
- MAQC Consortium: The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nature Biotechnology 2006, 24(9):1151–1161. 10.1038/nbt1239View ArticlePubMed CentralGoogle Scholar
- Klebanov L, Yakovlev A: How high is the Level of Technical Noise in Microarray Data? Biol Direct 2007, 2: 9–9. 10.1186/1745-6150-2-9View ArticlePubMedPubMed CentralGoogle Scholar
- Naef F, Hacker CR, Patil N, Magnasco M: Empirical Characterization of the Expression Ratio Noise Structure in High-density Oligonucleotide Arrays. Genome Biol 2002, 3: RESEARCH0018-RESEARCH0018. 10.1186/gb-2002-3-4-research0018View ArticlePubMedPubMed CentralGoogle Scholar
- Giles PJ, Kipling D: Normality of Oligonucleotide Microarray data and Implications for Parametric Statistical Analyses. Bioinformatics 2003, 19: 2254–2262. 10.1093/bioinformatics/btg311View ArticlePubMedGoogle Scholar
- Konu O, Li MD: Correlations Between mRNA Expression Levels and GC Contents of Coding and Untranslated Regions of Genes in Rodents. J Mol Evol 2002, 54: 35–41. 10.1007/s00239-001-0015-zView ArticlePubMedGoogle Scholar
- Amit I, Garber M, Chevrier N, Leite AP, Donner Y, Eisenhaure T, Guttman M, Grenier JK, Li W, Zuk O, Schubert LA, Birditt B, Shay T, Goren A, Zhang X, Smith Z, Deering R, McDonald RC, Cabili M, Bernstein BE, Rinn JL, Meissner A, Root DE, Hacohen N, Regev A: Unbiased Reconstruction of a Mammalian Transcriptional Network Mediating Pathogen Responses. Science 2009, 326: 257–263. 10.1126/science.1179050View ArticlePubMedPubMed CentralGoogle Scholar
- Huang DW, Sherman BT, Lempicki RA: Systematic and Integrative Analysis of Large gene Lists Using DAVID Bioinformatics Resources. Nat Protoc 2009, 4: 44–57. 10.1038/nprot.2008.211View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.