- Research article
- Open Access
Validation and characterization of DNA microarray gene expression data distribution and associated moments
BMC Bioinformatics volume 11, Article number: 576 (2010)
The data from DNA microarrays are increasingly being used in order to understand effects of different conditions, exposures or diseases on the modulation of the expression of various genes in a biological system. This knowledge is then further used in order to generate molecular mechanistic hypotheses for an organism when it is exposed to different conditions. Several different methods have been proposed to analyze these data under different distributional assumptions on gene expression. However, the empirical validation of these assumptions is lacking.
Best fit hypotheses tests, moment-ratio diagrams and relationships between the different moments of the distribution of the gene expression was used to characterize the observed distributions. The data are obtained from the publicly available gene expression database, Gene Expression Omnibus (GEO) to characterize the empirical distributions of gene expressions obtained under varying experimental situations each of which providing relatively large number of samples for hypothesis testing. All data were obtained from either of two microarray platforms - the commercial Affymetrix mouse 430.2 platform and a non-commercial Rosetta/Merck one. The data from each platform were preprocessed in the same manner.
The null hypotheses for goodness of fit for all considered univariate theoretical probability distributions (including the Normal distribution) are rejected for more than 50% of probe sets on the Affymetrix microarray platform at a 95% confidence level, suggesting that under the tested conditions a priori assumption of any of these distributions across all probe sets is not valid. The pattern of null hypotheses rejection was different for the data from Rosetta/Merck platform with only around 20% of the probe sets failing the logistic distribution goodness-of-fit test. We find that there are statistically significant (at 95% confidence level based on the F-test for the fitted linear model) relationships between the mean and the logarithm of the coefficient of variation of the distributions of the logarithm of gene expressions. An additional novel statistically significant quadratic relationship between the skewness and kurtosis is identified. Data from both microarray platforms fail to identify with any one of the chosen theoretical probability distributions from an analysis of the l-moment ratio diagram.
The current biological literature makes extensive use of gene mRNA expression data from experimental systems called gene chips/gene micro-arrays. These data are used to infer genomic level conclusions. For example, to infer the response of an organism or cell culture under treatment or perturbation. Microarrays as an experimental system are very valuable in that they provide a genome-wide picture (for all the (~30000) genes). Unfortunately, because of costs of collecting microarray data, the number of samples per treatment is quite small (~2-10).
The data from microarrays are noisy. There are a number of reasons to expect variability in the measurements of the expressions of the genes in mammalian organisms. These include biological causes, or the noise associated with the steps involved in the measurement of the gene expression. Depending on the question that the researcher is trying to answer, he/she would have to control for many of these sources of variation of gene expression. This paper is interested in understanding the variation in the expression data after the known/reported sources of variation have been controlled for.
The biological variability could be due to genetic or non-genetic factors  studies the cis-acting variation that explains differences of 15 genes in the human brain  reviews literature to suggest that the allele-specific differences in the rates of transcription are common. Other studies demonstrating the influence of genotype on gene expression include  (in the human thyroid tissue), [4, 5] (in human lymphoblastoid cell lines),  (human blood leukocytes),  (in human liver tissue) and  (in liver tissue from mice of different strains).
Among the non-genetic factors explaining the variation in gene expression include the gender of the organism -  (in human skeletal muscle),  (in human white blood cells),  (in human peripheral blood mononuclear cells),  (in human retina),  (in human blood) and  (in the liver and kidney of mice and rats). Age of the organism was shown to be a significant covariate in  (in human retina),  (in human peripheral blood mononuclear cells),  (in human skeletal muscle) and  (in human blood). There is a significant variability of gene expression across different tissue or cell types [15, 16]. There is even variation within a given tissue due to the presence of multiple cell types - [11, 13] (in human blood),  (in human skeletal muscle) and  (in human placenta). Other important factors include the diet and fasting status of the organism  and the time of day that the samples were taken . Another important variability factor is the environmental condition that the organism was under before the tissue sample was taken - for example the mice could be sleep deprived , undergoing craniofacial development , given oral doses of synthetic triglycerides  or fed doses of chemicals that are known to be lung carcinogens  or it could be medical students under psychological stress before a major exam .
The other classes of non-genetic gene expression variability that has been studied intensively (theoretically and experimentally) have been termed intrinsic, extrinsic and pathway-specific or global noises (see [23, 24] for reviews on this). Intrinsic noise is assigned to variation arising because of the stochastic nature of transcription and translation due to the small number of mRNA and protein molecules. Extrinsic noise could be due to changes in the cellular environment. These noises can be demonstrated experimentally by observing the output of two different reporters for the same gene in the same cell and separate cells. Pathway-specific noise can be viewed as the noise that is transferred along all the genes whose genes sequentially participate in given cellular or biochemical process.
Variability could arise at various stages involved in getting the output from a microarray and also after data from the microarray are obtained. This is in terms of intensity measurements (for the predefined surrogates for different genes that are termed probe sets) being normalized and preprocessed to get estimates of gene expressions. RNA are isolated from the cells obtained from the tissue sample has been drawn from the organism. The RNA are then subjected to the process of reverse transcription (RT) to obtain cDNA that are then subjected to the vitro transcription (IVT) process to obtain cRNA using polymerases. The cRNA are then hybridized to probes on the microarray platform . The factor that influence the final intensity measurements for different probes include the amount of polymerase used for the IVT process, the amount of time allotted for this process by the experimenter and the binding specificity of the cRNA to the corresponding probe sequence on the microarray platform . Tu et al  performed a small controlled analysis of the noise characteristics in the gene expression data from microarrays. They provide empirical distributions of gene expressions arising just because of the variation introduced by the above described process of getting to intensity values from the microarray, i.e., they performed an analysis of the measurement variability. It does not seem trivial to control for a consistent microarray experiment protocol -  demonstrated that the laboratory where the microarray experiment was performed is a major source of variation  found that microarray experiments performed in different years had different characteristics of gene expression. Different microarray data preprocessing and normalization algorithms generate expression data with different characteristics . In addition, microarrays as tools for detecting changes in gene expression have been shown to be sensitive and specific in a well defined range of its output or gene expression. Low gene expression can be confused with optical noise while due to saturation, changes at higher levels of gene expression are more difficult to detect .
Most of the current journals require the microarray data to be deposited on a database (like the Gene Expression Omnibus, GEO ) if these data were used for analysis in a paper. The data deposited on the database are in the form of multiple samples corresponding to each of multiple conditions (typically one of these conditions would correspond to the normal situation when there is no disease or toxin dosed). The data are deposited depending on the type of microarray chip used. There are different commercial manufactures of microarrays - Affymetrix and Agilent being the most popular. Each of these manufacturers makes chips for specific organisms (mouse, rat, frog, human zebrafish etc.) and within chips for specific organisms also there are different varieties.
One of the main areas where microarray data has had its application is in the identification of differentially expressed genes across varying treatment conditions. The approaches used could be classified based on whether they use parametric assumptions about the underlying distribution the gene expression or not. Kerr et al  used a ANOVA model to capture variation of gene expression arising from two color microarrays. The non-normality of the residuals of this model was noted in this work. Newton et al  assume that all gene expressions are gamma distributed - a distribution that is right skewed and whose coefficient of variation decreases with increasing mean. They attempted to validate this using real data and found that the fits did capture the primary features of variation but were poor in general. Rocke and Durbin  assume a additive-multiplicative error model for gene expression that is additive at low levels of gene expression and multiplicative at higher levels of expression. So by this one would expect the gene expression to be log-normally distributed at high levels of gene expression. Then there are a class of approaches called empirical Bayes methods - [34–36] are examples - that make prior assumptions about the distribution of gene expression and then use the data itself to get the posterior probability of the gene being differentially expressed. They are all based on the observation that mean and the coefficient of variation of the distribution of gene expression have a definite pattern of variation (eg. see ). The typical microarray set up uses a relatively small number of samples to generate data for tens of thousands of genes. The empirical Bayes approaches essentially attempt to improve the estimation of the variance of a given gene by utilizing the observed pattern of variation between mean and the coefficient of variation of expression of all the target genes in the system [34, 35] assume normal priors of gene expression  assumes a normal prior for the difference in means of the gene expression over the two conditions. Non-parametric methods have also been proposed to be used for the analysis of differential expression (eg. see ). Popular permutation-based method based on a modified t statistic is the so called SAM statistic and its modifications [39, 40]. Bayesian network inference represents another analysis that makes use of gene expression data and typically assumes that the data is normally distributed -  and . An additional analysis where gene expression has been significantly used is in the development of a prediction model or a classifier - for example in the development of a classifier between two different forms of leukemia , prediction of prognosis of patients with breast cancer  and the prediction of potential carcinogenicity of a chemical . Such analyses do not make distributional assumptions on the expression of the gene expression data and the efficacy of the models developed in the above examples was done using cross-validation of data.
The above paragraph describes a snapshot of the analysis done using gene expression data, some of the analyses make use of distributional assumptions and some do not. Since distributional assumptions are made frequently, it appears prudent to validate this assumption. As mentioned above Newton et al  and Kerr et al  attempted to verify their distributional assumptions and did not find their assumptions adequately supported by the data. Tu et al  did not consider the analysis of variation due to biological variability. The literature thus lacks an empirical validation of the distribution of gene expression as measured by microarrays across multiple univariate theoretical distributions with sufficient amount of data.
Since 2002 a significant amount of data from sources like the Gene Expression Omnibus GEO  and ArrayExpress ( has become available. This data can now potentially be used to validate the distribution and noise assumptions for statistical analysis and to develop improved inference methods to analyze the microarray data.
This paper focuses on identifying and validating empirical distribution fits of genome-wide gene expressions as measured by microarrays. In addition to the normal distribution we empirically tested the empirical fit for a number of well established probability distributions.
We analyzed four microarray data sets from the GEO database . They were all generated using the Affymetrix Mouse 430.2 platform. They are data from three tissues - brain, liver and craniofacial tissue and expression data for the so called "house keeping" genes  across over 6000 samples collected over a wide range of conditions. These data sets were chosen in part because they each had a relatively large number of samples generated by the same laboratory. Also, in light of the points mentioned the samples in two of the datasets (from the craniofacial and liver) were partially matched for gender, age, mouse strain tissue source and environmental conditions prior to sample collection. The data set from the brain involved three different strains of mice. The microarray data were preprocessed and normalized in the same manner. It is therefore expected that the primary sources of variability in gene expression would be a convolution of intrinsic, extrinsic, pathway-specific or global noise and noise associated with microarray sample preparation and hybridization to the DNA chip for the two data sets that each used the same strain of mice for all the samples. The brain data set would be expected to show additional genetic variation due to the utilization of three different strains in the generation of the samples.
Data sets used in the analysis
The microarray samples used in the analyses in this manuscript were based in part on six separate data sets (see Table 1 and Additional files 1, 2, 3, 4, Table S8, Table S3, Table S1 and Table S2 for better detail of the data in Table 1) obtained from the GEO database . Four of them were from the Affymetrix Mouse Genome 430.2 array that uses a single color, in situ oligonucleotide array technology while the other two were from a non-commercial platform using a two-color, spotted oligonucleotide array technology.
The GEO series codes for three of the Affymetrix data are GSE7759, GSE8396 and GSE9444-termed respectively as "Craniofacial", "Liver" and "Brain" based on the tissues from which the samples were drawn. The specific samples identified by their GSM codes that were used from the three data sets are provided in Additional file 3, Table S1. There were 105 samples present in the "Craniofacial" data set, 93 samples in the "Liver" data set and 69 samples in the "Brain" data set.
The data in the "Craniofacial" data set was analyzed in  where craniofacial development in mice was studied. C57BL/6J mice were sampled at 12 hour intervals from E10.5-E12.5. It is during this time that the facial structures in the mice form. Tissues from three distinct regions - frontonasal, maxillary and the mandibular prominence at each of these time points and analyzed for transcriptomic changes. Seven biological replicate samples were collected from each of these regions at every 12 hour time interval.
The data in the "Liver" data set was analyzed in  where the effects of various synthetic dietary triglycerides on the hepatic gene expression were studied. One of the objectives of this study was to understand the role of PPARα in the observed effects. Male mice of SV129 strain and PPARα-/- mice (2-6 months of age) on SV129 background were used. Apart from a control, the mice were treated with two synthetic agonists and four synthetic triglycerides. There were four or five biological replicate samples for each treatment and wild type or knockout mice combination.
The data in the "Brain" data set was analyzed in . The objective of this study was to analyze the transcriptomic changes in the brain and the liver resulting from varying amounts of sleep loss in 3 different strains (AKR/J, DBA/2J and C57BL/6J) of mice. In this manuscript only a subset of the samples from the brain were used. The mice of each strain were deprived of sleep for various time periods. There were three biological replicate samples per combination of strain and time period of sleep deprivation.
In addition, 6219 microarray samples on the Affymetrix 430.2 platform were also downloaded from GEO . The GSM sample ids along with a description of the experimental conditions under which the samples were collected are given in Additional file 4, Table S2. The samples were chosen using the GPLBrowse program  in April 2009. This program plots various moments of different samples on the GEO database versus each other. The scatter plot of the logarithm of mean of expression of all the samples versus the logarithm of the standard deviation of the samples on the GEO database from the Affymetrix Mouse 430.2 was visually examined. Samples that did not seem like outliers were chosen to form the set of samples that formed the basis of our analysis. Note that the raw data was downloaded in all cases and then preprocessed in a well-defined manner (see below). These samples were collected from mice of different strains, development stages, tissues, sex, and different laboratories over the past few years. The distribution of expression of the 21 house keeping genes identified in  was analyzed using this data.
The data from the non-commercial spotted oligonucleotide array (GEO accession: GSE2814) were analyzed in a number of papers [49–54] where one of the goals was to sex-specific and tissue-specific differences in gene expression. The data were from the liver of ApoE null (C57BL/6J × C3H/HeJ)F2 intercross mice. There were 155 male samples and 156 female samples - the data sets is hereafter called "Male" and "Female" respectively.
Data preprocessing and normalization
For each of the four Affymetrix data sets used, raw CEL format data from GEO was normalized using the R Bioconductor  implementation of the GCRMA normalization procedure . The Affymetrix Mouse 430.2 platform has data for 45101 probe sets in all. Each of the probe sets potentially map to a gene in the mouse genome. The samples in each of the "Craniofacial", "Liver" and "Brain" data sets were normalized separately. As described in the previous section, only a small subset of samples in each of three datasets was replicates, the remaining was obtained under different conditions (e.g. different development stage in a different facial region for the "Craniofacial" data). The altered conditions would have effects on the expressions of some of the genes in each of the data sets. In this manuscript, for each of the three datasets ("Craniofacial", "Liver" and "Brain") an attempt is made to identify those probe sets or genes whose expressions were not altered under all the conditions involved. This is done using the non-parametric Kruskal-Wallis one way analysis of variance test  for the expressions of each gene across all samples in a given data set. Probe sets with a p-value greater than a liberal 0.1 cutoff from the Kruskal Wallis test were deemed to be unaffected by any of the conditions involved in the generation of a given data set. Results of varying this 0.1 cutoff sensitivity to the results of distributional tests are provided in the Additional material. The Kruskal-Wallis test also assumes that the distributional form of the data being compared across the conditions remains the same. An alternate method that does not need the above assumption is the analysis using bootstrapping of residuals  from a heteroscedastic one-way anova model . However, this assumption results in over 99% of all probe sets (data not shown) in each of the three data sets being deemed unchanged over conditions involved in generation of the probe sets.
For the data set of 6219 microarray samples RAM memory limitations prohibit normalizing all samples together. To normalize these, the following steps were followed:
The samples were partitioned into sets of 75.
The gene expression data for each of these 75 samples were obtained after running the GCRMA routine.
Using the data from step 2, the gene expression across the whole data set of 6219 samples were normalized with respect to each other using the quantile normalization method as described in(Bolstad B: Probe level quantile normalization of high density oligonucleotide array data. Unpublished manuscript 2001.).
Yang et al  that use the "Male" and "Female" data define "active expressed" genes and a procedure to obtain them. In summary, genes differential expressed with respect to the reference pool are identified by a p-value cutoff of 0.01 from an error model used in their analysis. In addition to this set of transcriptionally active genes, genes that were significantly correlated (Pearson correlation p-value < 10-5) with these active genes were also chosen. This combined set of genes is what is used in the analysis in this paper. In the interest of having a relatively large number of genes for analysis, further criteria used by Yang et al  like presence of an eQTL with LOD score cutoff or significant correlations with the characterized traits of adiposity, plasma lipids, or atherosclerosis were ignored. The log ratio data were used then used as-is as downloaded from GEO.
Testing Distributional Assumptions
For each of the data sets "Craniofacial", "Liver" and "Brain", "Housekeeping", "Male" and "Female", Kolmogorov-Smirnov (KS) and Anderson-Darling (AD) hypothesis tests were used to test distributional assumptions. Both test the null hypothesis that a set of data comes from a given distribution, with distributional parameters possibly unknown. The AD test is more sensitive to differences in the tails of the data than the KS distribution. In testing for Normality, the AD test is known to be more powerful than the KS test . Normal, log-normal, logistic, log-logistic, Weibull, and extreme value distributions were tested. The "Male" and "Female" data were just tested on the normal, logistic and extreme value distributions because the log-ratio data from these data sets could be negative. The distributions were chosen because both tables of critical values for the tests with unknown distribution parameters and efficient implementations of parameter are available. In this manuscript the results for the AD are provided while those for the KS tests are provided in the Additional material.
For each of the data sets and each distribution F, each of identified (either by the Kruskal Wallis test for the "Craniofacial", "Liver" and "Brain" data sets or as actively expressed for the "Male" and "Female" data sets, as explained in the previous section) probe sets were tested in the following manner:
Use maximum likelihood estimation (MLE) to estimate for F
Use the KS and AD tests at the 90% and 95% to test whether the probe data comes from .
For each of these distributions, MATLAB version R2009B Statistics Toolbox MLE functions were used. Critical values can be found in .
In addition, for housekeeping genes, KS and AD tests were used to test for gamma and Pareto distributions across a set of 6219 samples. For these distributions, tables of critical values do not exist, but a method for generating p-values for KS tests with unknown distribution parameters from  was used. For the gamma distribution, MATLAB code Fastfit  was used for parameter estimation. For the Pareto distribution, known closed-form solutions for ML estimators are used.
L-moment ratio diagram
Mixture distribution fit
The Kruskal Wallis test was used on the logarithm of expression of the probe sets in each of the "Craniofacial", "Liver" and "Brain" data sets in order to identify those that were most likely unaffected by any of the conditions involved in the generation of these data sets. The results in Table 2 indicate that more than 50% (~25000/45000) of the probe sets were unaffected in the "Brain" data set while a far fewer were unaffected for the "Craniofacial" and the "Liver" data sets. Further the results in Table 2 indicate that of these unaffected probe sets for each of the three data sets, more than 50% rejected the null hypotheses (at a 95% confidence level) for all the distribution tests considered - Normal, Weibull, Extreme value, Logistic, Log-normal and Log-logistic. The same can be said about the house keeping genes . The data for these are based on 6219 microarray samples. Also, it seems that at least one of the above six distributions failed to be rejected for between 25 and 52% of the probe sets in each of the three data sets. Additional simulation-based Anderson-Darling (AD) tests were performed for the Gamma and Pareto distributions for the house keeping genes. 23/23 housekeeping probe sets rejected the AD null hypothesis for the Gamma distribution at the 95% confidence level while only 2/23 rejected the hypotheses for the Pareto distribution. However, the Kolmogorov-Smirnov (KS) test rejected 23/23 null hypotheses for both the Gamma and Pareto distributions. Additional results obtained by varying the Kruskal Wallis test p-value cutoff and the results obtained from the Kolmogorov-Smirnov goodness of fits tests are given in Additional files 5 and 6, Tables S6 and S4 respectively. Detailed probe level results are provided in Additional file 7, Table S5.
The results of the goodness of fit Anderson-Darling distribution tests for the "Male" and "Female" data sets showed different characteristics from those of the other data sets. Only around 43-46% of the probe sets rejected the normal hypothesis (as compared with 72-82% for the previous data sets). The logistic distribution was rejected less often than the previous data sets (21-24% as compared with 69-79%). The fit of the extreme value distributions were equally bad for both sets of data. This difference in characteristics of the goodness of fit test results between two different microarray platforms indicates the contribution of technology and/or of the normalization methodology to the distribution characteristics of microarray data.
The dependence of the mean on the higher order product moments are shown in Figures 1 and 2 - 1(a)-1(c), 2(a)-2(b): (mean versus coefficient of variation), 1(d)-1(f), 2(c)-2(d): (mean versus skewness) and 1(g)-1(i), 2(e)-2(f): (mean versus kurtosis). Note in the case of the "Male" and "Female" data sets the higher order moments are plotted against the absolute value of the mean of the log-ratio data. The overall trends of the higher moments with the means are consistent. Note that in the range of lower mean expression the coefficient of variation, skewness and kurtosis are unusually high. This probably reflects genes whose expressions are below the detection limit of the microarray and can be treated as optical noise  for the case of the Affymetrix data. For the "Male" and "Female" data this may reflect the instability of the ratio for low expression genes. By visual inspection of Figure 1, a value of 6 is chosen as cutoff of mean of the logarithm of gene expression so that mean of the logarithm of the measurements of the gene expression below 6 are considered noise and a cutoff of 0.05 absolute value of ratio was chosen for Figure 2. Loess curves (represented by green lines in the subplots) are fit through the noisy measurements and polynomials (represented by red lines in the subplots) are fit through the other data to describe the trend between mean and the next three higher product moments. A linear fit to the trend between the logarithms of the coefficient of variation (CV) and the mean of the logarithms of mean of expression of different genes are shown in Figure 1. The parameters of the linear fit are given in Table 3. The decreasing trend of the coefficient of variation with the mean of gene expression is valid across the five data sets. This recapitulates what is already known about microarray data (see ). The suggestion of consistency between the trends is also clear from observation of the coefficients in Table 3 (separately for the Affymetrix data sets and the Rosetta/Merck ("Male", "Female") data sets). This consistency is however not statistically valid. Also, note that the confidence intervals generated for the parameters in Table 3 are probably not entirely valid given issues of heteroscedasticity and non-normality of the residuals associated with the three fits (see Additional files 8, 9, 10, Figures S1-S3). Skewness and kurtosis did not have a significant trend with the mean of distribution the gene expression as measured by the R2 values for different polynomial fits (the R2 values did not exceed 0.1). So the mean values of skewness and kurtosis are plotted by the red lines in Figure 1d-1i, 2c-2f. These mean values are given in Table 4. The Normal distribution has a skewness value of 0 and a kurtosis of 3. The values in Table 4 suggest a majority of the distributions of gene expression as being more negatively (positively) skewed than the Normal and as having a higher (higher) kurtosis value for the Affymetrix (Rosetta/Merck) data.
The relationship between the kurtosis and the skewness of the distribution of gene expressions is studied next. The variations of kurtosis of the distribution of gene expression with skewness for the three data sets are plotted in Figures 3 and 4. Note this plot is restricted to only those genes with mean expression greater than 6 for Figure 3 and absolute value log ratio greater than 0.05 for Figure 4. The quadratic fits to the trends are also shown in Figures 3 and 4. The coefficients of the fit are given in Table 5. The confidence intervals for these coefficients may again be questioned given the issues with the heteroscedasticity and non-normality of the residuals (see Additional files 11, 12, 13, Figures S4-S6). Alternate non-linear models were evaluated for their ability to alleviate these issues. However a solution wasn't found that would make reduce both concerns simultaneously. So in the interest of simplicity and also given the fact the stated confidence intervals are not used to draw further conclusions, the results are presented without making any further attempts to satisfy the assumptions of homoscedasticity and normality of residuals. Theoretically, the kurtosis (denoted by K) and the skewness (denoted by S) have to satisfy the following inequality (see ),
The data in Table 5 are again suggestive (though not statistically valid) of a consistent quadratic trend across the three data sets between kurtosis and skewness of the distributions of gene expression.
Further validation of the lack of fit of gene expression to any of the standard theoretical univariate probability distributions can be seen in the L-moment ratio diagram in Figure 5. The theory of L-moments was introduced in  and they have been shown to provide unbiased estimates of the higher order moments of a probability distribution. The standard product-moments have been shown to be significantly affected by the sample size and the presence of outliers . The relationship between the third-order and fourth-order L-moments (representing the skewness and kurtosis of the distribution) can be plotted on an diagram called the L-moment ratio diagram. This is shown in Figure 5. The L-skewness and L-kurtosis for all the unaffected probe sets (with mean expression greater than 6) in each of four Affymetrix data sets are plotted in Figure 6a-6d and for the actively expressed genes (with absolute value of log ratio greater than 0.05) for the Rosetta/Merck data sets in Figure 7a-7b. The l-moment plots with all the data including those with mean expression less than 6 for the Affymetrix data is provided in Additional file 14, Figure S7 and those for all Rosetta/Merck data including those with absolute value of mean of log ratio of gene expression greater than 0.05 is provided in Additional file 15, Figure S8. The probe sets in each of these subplots represented as circles are also colored. The intensity of redness of probe set circle reflects its mean expression. This coloring scheme points to the lack of dependence of the L-(skewness and kurtosis) moments on the mean expression. The consistent non-linear relationships across the three data sets between the product moments skewness and kurtosis identified in Figures 3 and 4 also appear to hold for the l-skewness and l-kurtosis.
The samples from the "Male" and "Female" datasets could be considered more or less homogenous with respect to sex, tissue, diet and experimental conditions. One reason we could be seeing poor fits to standard distributions could be that there are different modes to the distribution of the expression of a given gene reflecting the genetic variation in the F2 cross animals or a stochastic network influence (e.g. gene involved in a given biochemical pathway) of its expression . Hence it is possible for a mixture of distributions to better fit the observed data. The results of chi-square fits to a three normal mixture distribution are given in Table 6. There is no improvement in fit to the mixture distribution on comparison of the results with the chi-square fit to a single normal.
In the past several years, there has been an explosion in amount of quantitative biological data either in terms of transcriptomics, sequencing data, genetic structure variation, proteomics or metabolomics. DNA microarrays have been important and valuable resource for understanding perturbations to biological systems in terms of identifying affected gene expressions. The standard statistical methods are being either directly used or modified to work with gene expression data. Unfortunately, only a small number of replicate samples per treatment (2-10) are used for analysis owing to the cost of the experimentsal system. This point plus the fact that the probability distributions of gene expressions as measured by these arrays were not characterized leads one of logically question the use various statistical methods that are based on distributional assumptions. Heuristics are being proposed that attempt to relax the reliance on this distributional assumption. One example of this would be the method of jointly using the p-value from a two sample t-test along with the gene expression fold-change to identify differentially expressed genes. Alternately, there is an increased use of non-parametric methods or permutation-based methods [69, 38].
The essential question that we address in this manuscript is whether the distribution of the logarithm of gene expression as measured by DNA microarrays can be approximated by any of the standard theoretical univariate probability distributions. The results in Table 2 and Figures 6 and 7 suggest that it is unlikely that there is a known probability distribution that all gene expressions would follow. Now if this is the case then an alternative would be is to see if there are consistent relationships between the various moments of the distributions of the gene expressions. The analyses in Figure 1, 2, 3, 4 suggest that there are consistent (though not statistically valid) relationships between the mean and coefficient of variation (Additional file 16, Table S7 lists the spearman rank correlation between the mean and standard deviation of the gene expressions corresponding to each of the data sets analyzed) and the skewness and kurtosis of the distribution of gene expression respectively. Comparison of results as obtained from a commercial Affymetrix platform and a non-comercial Rosetta/Merck platform also indicated the influences of the microarray technology on the distribution characteristics of the data.
The observed distributional characteristics of gene expression data in this manuscript suggest either the need for the use of non-parametric statistical methods or a need to develop newer statistical/mathematical approaches that are capable of and are optimal for working with these kinds of distributions.
Because of the nature of the data used in this paper, we are unable to separate the contribution to the variation of the data due to biological reasons from those induced by the microarray technology. The noise we observe is probably the result of the convolution of these two factors. In light of the increasing use of newer technologies like those based on Next Generation Sequencing  and despite the real possibility of the reduced future use of DNA microarrays an analysis like that presented in this paper would be useful in guiding analyses of these new data and also in making distributional hypotheses.
The analyses of the empirical probability distribution of gene expressions from five publicly available data sources with relatively large number of samples have been described in this manuscript. The failure of the distributions to follow any of the known theoretical univariate probability distributions has been demonstrated though the data suggests consistent relationships between the different moments of the distributions. These moment relationships should motivate the development of Bayesian methods with appropriately chosen priors.
Bray NJ, Buckland PR, Owen MJ, O'Donovan MC: Cis-acting variation in the expression of a high proportion of genes in human brain. Hum Genet 2003, 113: 149–153.
Buckland PR: Allele-specific gene expression differences in humans. Hum Mol Genet 2004, 13(Spec No 2):R255–260. 10.1093/hmg/ddh227
He H, Olesnanik K, Nagy R, Liyanarachchi S, Prasad ML, Stratakis CA, Kloos RT, de la Chapelle A: Allelic variation in gene expression in thyroid tissue. Thyroid 2005, 15: 660–667. 10.1089/thy.2005.15.660
McRae AF, Matigian NA, Vadlamudi L, Mulley JC, Mowry B, Martin NG, Berkovic SF, Hayward NK, Visscher PM: Replicated effects of sex and genotype on gene expression in human lymphoblastoid cell lines. Hum Mol Genet 2007, 16: 364–373. 10.1093/hmg/ddl456
Monks SA, Leonardson A, Zhu H, Cundiff P, Pietrusiak P, Edwards S, Phillips JW, Sachs A, Schadt EE: Genetic inheritance of gene expression in human cell lines. Am J Hum Genet 2004, 75: 1094–1105. 10.1086/426461
Sharma A, Sharma VK, Horn-Saban S, Lancet D, Ramachandran S, Brahmachari SK: Assessing natural variations in gene expression in humans by comparing with monozygotic twins using microarrays. Physiological Genomics 2005, 21: 117–123. 10.1152/physiolgenomics.00228.2003
Schadt EE, Molony C, Chudin E, Hao K, Yang X, Lum PY, Kasarskis A, Zhang B, Wang S, Suver C, Zhu J, Millstein J, Sieberts S, Lamb J, GuhaThakurta D, Derry J, Storey JD, Avila-Campillo I, Kruger MJ, Johnson JM, Rohl CA, van Nas A, Mehrabian M, Drake TA, Lusis AJ, Smith RC, Guengerich FP, Strom SC, Schuetz E, Rushmore TH, Ulrich R: Mapping the genetic architecture of gene expression in human liver. PLoS Biol 2008, 6: e107. 10.1371/journal.pbio.0060107
Pritchard C, Coil D, Hawley S, Hsu L, Nelson PS: The contributions of normal variation and genetic background to mammalian gene expression. Genome Biol 2006, 7: R26. 10.1186/gb-2006-7-3-r26
Kang PB, Kho AT, Sanoudou D, Haslett JN, Dow CP, Han M, Blasko JM, Lidov HGW, Beggs AH, Kunkel LM: Variations in gene expression among different types of human skeletal muscle. Muscle & Nerve 2005, 32: 483–491.
Khymenets O, Covas MI, Farre M, Langohr K, Fito M, de la Torre R: Role of sex and time of blood sampling in SOD1 and SOD2 expression variability. Clin Biochem 2008, 41: 1348–1354. 10.1016/j.clinbiochem.2008.08.064
Eady JJ, Wortley GM, Wormstone YM, Hughes JC, Astley SB, Foxall RJ, Doleman JF, Elliott RM: Variation in gene expression profiles of peripheral blood mononuclear cells from healthy volunteers. Physiol Genomics 2005, 22: 402–411. 10.1152/physiolgenomics.00080.2005
Chowers I, Liu D, Farkas RH, Gunatilaka TL, Hackam AS, Bernstein SL, Campochiaro PA, Parmigiani G, Zack DJ: Gene expression variation in the adult human retina. Hum Mol Genet 2003, 12: 2881–2893. 10.1093/hmg/ddg326
Whitney AR, Diehn M, Popper SJ, Alizadeh AA, Boldrick JC, Relman DA, Brown PO: Individuality and variation in gene expression patterns in human blood. Proceedings of the National Academy of Sciences of the United States of America 2003, 100: 1896. 10.1073/pnas.252784499
Boedigheimer MJ, Wolfinger RD, Bass MB, Bushel PR, Chou JW, Cooper M, Corton JC, Fostel J, Hester S, Lee JS: Sources of variation in baseline gene expression levels from toxicogenomics study control animals across multiple laboratories. BMC genomics 2008, 9: 285. 10.1186/1471-2164-9-285
Pritchard CC, Hsu L, Delrow J, Nelson PS: Project normal: defining normal variance in mouse gene expression. Proceedings of the National Academy of Sciences 2001, 98: 13266. 10.1073/pnas.221465998
Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G: A gene atlas of the mouse and human protein-encoding transcriptomes. Proceedings of the National Academy of Sciences of the United States of America 2004, 101: 6062. 10.1073/pnas.0400782101
Sood R, Zehnder JL, Druzin ML, Brown PO: Gene expression patterns in human placenta. Proceedings of the National Academy of Sciences 2006, 103: 5478. 10.1073/pnas.0508035103
Maret S, Dorsaz S, Gurcel L, Pradervand S, Petit B, Pfister C, Hagenbuchle O, O'Hara B, Franken P, Tafti M: Homer1a is a core brain molecular correlate of sleep loss. Sleep 2008, 31: A360-a360.
Leach SM, Tipney H, Feng WG, Baumgartner WA, Kasliwal P, Schuyler RP, Williams T, Spritz RA, Hunter L: Biomedical Discovery Acceleration, with Applications to Craniofacial Development. Plos Computational Biology 2009., 5: 10.1371/journal.pcbi.1000215
Sanderson LM, de Groot PJ, Hooiveld GJ, Koppen A, Kalkhoven E, Muller M, Kersten S: Effect of synthetic dietary triglycerides: a novel research paradigm for nutrigenomics. PLoS One 2008, 3: e1681. 10.1371/journal.pone.0001681
Thomas RS, O'Connell TM, Pluta L, Wolfinger RD, Yang L, Page TJ: A comparison of transcriptomic and metabonomic technologies for identifying biomarkers predictive of two-year rodent cancer bioassays. Toxicol Sci 2007, 96: 40–46. 10.1093/toxsci/kfl171
Kawai T, Morita K, Masuda K, Nishida K, Shikishima M, Ohta M, Saito T, Rokutan K: Gene expression signature in peripheral blood cells from medical students exposed to chronic psychological stress. Biol Psychol 2007, 76: 147–155. 10.1016/j.biopsycho.2007.07.008
Maheshri N, O'Shea EK: Living with noisy genes: how cells function reliably with inherent variability in gene expression. Annu Rev Biophys Biomol Struct 2007, 36: 413–434. 10.1146/annurev.biophys.36.040306.132705
Raser JM, O'Shea EK: Noise in gene expression: origins, consequences, and control. Science 2005, 309: 2010–2013. 10.1126/science.1105891
Tu Y, Stolovitzky G, Klein U: Quantitative noise analysis for gene expression microarray experiments. Proceedings of the National Academy of Sciences 2002, 99: 14031. 10.1073/pnas.222164199
Dai H, Meyer M, Stepaniants S, Ziman M, Stoughton R: Use of hybridization kinetics for differentiating specific from non-specific binding to oligonucleotide microarrays. Nucleic Acids Res 2002, 30: e86. 10.1093/nar/gnf085
Thomas RS, Bao W, Chu TM, Bessarabova M, Nikolskaya T, Nikolsky Y, Andersen ME, Wolfinger RD: Use of Short-term Transcriptional Profiles to Assess the Long-term Cancer-Related Safety of Environmental and Industrial Chemicals. Toxicological Sciences 2009.
Ryden P, Andersson H, Landfors M, Naslund L, Hartmanova B, Noppa L, Sjostedt A: Evaluation of microarray data normalization procedures using spike-in experiments. Bmc Bioinformatics 2006., 7: 10.1186/1471-2105-7-300
Naef F, Socci ND, Magnasco M: A study of accuracy and precision in oligonucleotide arrays: extracting more signal at large concentrations. Bioinformatics 2003, 19: 178. 10.1093/bioinformatics/19.2.178
Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Edgar R: NCBI GEO: mining tens of millions of expression profiles--database and tools update. Nucleic Acids Res 2007, 35: D760–765. 10.1093/nar/gkl887
Kerr MK, Martin M, Churchill GA: Analysis of variance for gene expression microarray data. Journal of Computational Biology 2000, 7: 819–837. 10.1089/10665270050514954
Newton MA, Kendziorski CM, Richmond CS, Blattner FR, Tsui KW: On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. Journal of Computational Biology 2001, 8: 37–52. 10.1089/106652701300099074
Rocke DM, Durbin B: A model for measurement error for gene expression arrays. Journal of Computational Biology 2001, 8: 557–569. 10.1089/106652701753307485
Baldi P, Long AD: A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics 2001, 17: 509. 10.1093/bioinformatics/17.6.509
Jain N, Thatte J, Braciale T, Ley K, O'Connell M, Lee JK: Local-pooled-error test for identifying differentially expressed genes with a small number of replicated microarrays. Bioinformatics 2003, 19: 1945. 10.1093/bioinformatics/btg264
Smyth GK: Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical applications in genetics and molecular biology 2004, 3: 1027. 10.2202/1544-6115.1027
Hatfield GW, Hung S, Baldi P: Differential analysis of DNA microarray gene expression data. Molecular microbiology 2003, 47: 871–877. 10.1046/j.1365-2958.2003.03298.x
Troyanskaya OG, Garber ME, Brown PO, Botstein D, Altman RB: Nonparametric methods for identifying differentially expressed genes in microarray data. Bioinformatics 2002, 18: 1454. 10.1093/bioinformatics/18.11.1454
Tusher T, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. PNAS 2001, 98: 5116–5121. 10.1073/pnas.091062498
Wu B: Differential gene expression detection using penalized linear regression models: the improved SAM statistics. Bioinformatics 2005, 21: 1565. 10.1093/bioinformatics/bti217
Yamanaka T, Toyoshiba H, Sone H, Parham FM, Portier CJ: The TAO-Gen algorithm for identifying gene interaction networks with application to SOS repair in E. coli. Environmental health perspectives 2004, 112: 1614. 10.1289/ehp.7105
Yu J, Smith VA, Wang PP, Hartemink AJ, Jarvis ED: Advances to Bayesian network inference for generating causal networks from observational biological data. Bioinformatics-Oxford 2004, 20: 3594–3603. 10.1093/bioinformatics/bth448
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. science 1999, 286: 531. 10.1126/science.286.5439.531
van de Vijver MJ, He YD, van't Veer LJ, Dai H, Hart AAM, Voskuil DW, Schreiber GJ, Peterse JL, Roberts C, Marton MJ: A gene-expression signature as a predictor of survival in breast cancer. The New England journal of medicine 2002, 347: 1999. 10.1056/NEJMoa021967
Thomas RS, O'Connell TM, Pluta L, Wolfinger RD, Yang L, Page TJ: A comparison of transcriptomic and metabonomic technologies for identifying biomarkers predictive of two-year rodent cancer bioassays. Toxicological Sciences 2007, 96: 40. 10.1093/toxsci/kfl171
Parkinson H, Sarkans U, Shojatalab M, Abeygunawardena N, Contrino S, Coulson R, Farne A, Lara GG, Holloway E, Kapushesky M, Lilja P, Mukherjee G, Oezcimen A, Rayner T, Rocca-Serra P, Sharma A, Sansone S, Brazma A: ArrayExpress - a public repository for microarray gene expression data at the EBI. Nucleic Acids Research 2005, 33: D553-D555. 10.1093/nar/gki056
She X, Rohl CA, Castle JC, Kulkarni AV, Johnson JM, Chen R: Definition, conservation and epigenetics of housekeeping and tissue-enriched genes. BMC Genomics 2009, 10: 269. 10.1186/1471-2164-10-269
Robbins KA, Burkhardt C: GPLBrowse: An Interactive Platform Browser for NCBI GEO. CS-TR-2007–008: Department of Computer Science, University of Texas at San Antonio; 2007.
Chen Y, Zhu J, Lum PY, Yang X, Pinto S, MacNeil DJ, Zhang C, Lamb J, Edwards S, Sieberts SK: Variations in DNA elucidate molecular networks that cause disease. Nature 2008, 452: 429–435. 10.1038/nature06757
Ghazalpour A, Doss S, Zhang B, Wang S, Plaisier C, Castellanos R, Brozell A, Schadt EE, Drake TA, Lusis AJ: Integrating genetic and network analysis to characterize genes related to mouse weight. PLoS Genet 2006, 2: e130. 10.1371/journal.pgen.0020130
van Nas A, Ingram-Drake L, Sinsheimer JS, Wang SS, Schadt EE, Drake T, Lusis AJ: Expression Quantitative Trait Loci: Replication, Tissue-and Sex-Specificity in Mice. Genetics
Wang S, Yehya N, Schadt EE, Wang H, Drake TA, Lusis AJ: Genetic and genomic analysis of a fat mass trait with complex inheritance reveals marked sex specificity. PLoS Genet 2006, 2: e15. 10.1371/journal.pgen.0020015
Wu S, Mar-Heyming R, Dugum EZ, Kolaitis NA, Qi H, Pajukanta P, Castellani LW, Lusis AJ, Drake TA: Upstream transcription factor 1 influences plasma lipid and metabolic traits in mice. Human molecular genetics 2009.
Yang X, Schadt EE, Wang S, Wang H, Arnold AP, Ingram-Drake L, Drake TA, Lusis AJ: Tissue-specific expression and regulation of sexually dimorphic genes in mice. Genome Research 2006, 16: 995. 10.1101/gr.5217506
Gentleman R, Carey V, Bates D, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J: Bioconductor: open software development for computational biology and bioinformatics. Genome biology 2004, 5: R80. 10.1186/gb-2004-5-10-r80
Wu Z, Irizarry RA, Gentleman R, Martinez-Murillo F, Spencer F: A model-based background adjustment for oligonucleotide expression arrays. Journal of the American Statistical Association 2004, 99: 909–917. 10.1198/016214504000000683
Lehmann EL, D'Abrera HJM: Nonparametrics: statistical methods based on ranks. Prentice Hall Upper Saddle River, New Jersey; 1998.
Efron B: Nonparametric estimates of standard error: The jackknife, the bootstrap and other methods. Biometrika 1981, 68: 589. 10.1093/biomet/68.3.589
Akritas MG, Papadatos N: Heteroscedastic one-way ANOVA and lack-of-fit tests. Journal of the American Statistical Association 2004, 99: 368–382. 10.1198/016214504000000412
Law AM, Kelton WD: Simulation modelling and analysis. New York et al 1991.
Ross SM: Simulation-Statistical Modeling and Decision Science. Academic Press; 1997.
The Fastfit Matlab toolbox[http://research.microsoft.com/en-us/um/people/minka/software/fastfit/]
Hosking JRM: L-moments: analysis and estimation of distributions using linear combinations of order statistics. Journal of the Royal Statistical Society Series B (Methodological) 1990, 105–124.
Vogel RM, Fennessey NM: L moment diagrams should replace product moment diagrams. Water Resources Research 1993, 29: 1745–1752. 10.1029/93WR00341
Asquith WH: lmomco: L-moments, Trimmed L-moments, L-comoments, and Many Distributions. 2009. R package version 0.97.4
Team RDC: R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2006.
Macdonald PDM, Green PEJ: User's Guide to Program MIX: An Interactive Program for Fitting Mixtures of Distributions. 1988.
Kendall MG, Stuart A: The advanced theory of statistics. Vol. 1, Distribution theory. Griffin Bucks; 1969.
Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays. Google Patents; 2008.
Mardis ER: Next-generation DNA sequencing methods. 2008.
This research was supported [in part] by the Intramural Research Program of the NIH, National Institute of Environmental Health Sciences.
RT and SM designed the study. LdT carried out the all goodness of fits analyses and contributed to the write-up. XC contributed in checking the annotations of the entire microarray data samples used and also in the interpretation of the results. RT performed the l-moment analysis and drafted the manuscript. SM also contributed to the drafting of the manuscript. All authors read and approved the final manuscript.
Electronic supplementary material
Additional file 1: : Detailed description of the data in Table 1. This includes more description of the microarray platform and the base pair length of the probes used on each of the microarray platforms. (XLSX 12 KB)
Additional file 3: : GEO  microarray samples for the "Craniofacial", "Liver" and "Brain" data sets. (XLS 50 KB)
Additional file 4: : GEO  microarray samples used to analyze the expression of probe sets corresponding to the house keeping genes (XLS 5 MB)
Additional file 5: : Results of goodness of fit tests for the Anderson-Darling tests for the six analyzed probability distributions obtained by the varying the cutoff for the Kruskal-Wallis test. (XLS 31 KB)
Additional file 6: : Best fit distribution Kolmogorov-Smirnov test results for "Craniofacial", "Liver", "Brain", "Male" and "Female" data sets (XLS 14 KB)
Additional file 7: : Probe set level best fit distribution results for the Anderson Darling (AD) and the Kolmogorov-Smirnov test (KS) tests at 90 and 95 percent confidence levels (XLS 18 MB)
Additional file 8: : Diagnostic plot for the linear model between the logarithm of the coefficient of variation (CV) and the mean of the distribution of the logarithm of gene expression for the "Craniofacial" data set. Note the residual plots (subplots (b) and (c)) also provide the pearson correlation (denoted by ρ) between the absolute value of the residuals and the mean and logarithm of the CV respectively. (TIFF 228 KB)
Additional file 9: : Diagnostic plot for the linear model between the logarithm of the coefficient of variation (CV) and the mean of the distribution of the logarithm of gene expression for the "Liver" data set. Note the residual plots (subplots (b) and (c)) also provide the pearson correlation (denoted by ρ) between the absolute value of the residuals and the mean and logarithm of the CV respectively. (TIFF 173 KB)
Additional file 10: : Diagnostic plot for the linear model between the logarithm of the coefficient of variation (CV) and the mean of the distribution of the logarithm of gene expression for the "Brain" data set. Note the residual plots (subplots (b) and (c)) also provide the pearson correlation (denoted by ρ) between the absolute value of the residuals and the mean and logarithm of the CV respectively. (TIFF 162 KB)
Additional file 11: : Diagnostic plot for the quadratic model between the kurtosis and the skewness of the distribution of the logarithm of gene expression for the "Craniofacial" data set. Note the residual plots (subplots (b) and (c)) also provide the pearson correlation (denoted by ρ) between the absolute value of the residuals and the skewness and kurtosis respectively. (TIFF 105 KB)
Additional file 12: : Diagnostic plot for the quadratic model between the kurtosis and the skewness of the distribution of the logarithm of gene expression for the "Liver" data set. Note the residual plots (subplots (b) and (c)) also provide the pearson correlation (denoted by ρ) between the absolute value of the residuals and the skewness and kurtosis respectively. (TIFF 107 KB)
Additional file 13: : Diagnostic plot for the quadratic model between the kurtosis and the skewness of the distribution of the logarithm of gene expression for the "Brain" data set. Note the residual plots (subplots (b) and (c)) also provide the pearson correlation (denoted by ρ) between the absolute value of the residuals and the skewness and kurtosis respectively. (TIFF 140 KB)
Additional file 14: : l-moment ratio diagram for the four data sets as in Figure 6 but including all probe sets including those with mean expression less than 6. (EPS 2 MB)
Additional file 15: : l-moment ratio diagram for the four data sets as in Figure 7 but including all probe sets including those with absolute value of mean of log expression ratio less than 0.05. (EPS 2 MB)
Additional file 16: : Spearman rank correlation between the mean and standard deviation of the measured data for "Craniofacial", "Liver", "Brain", "Male" and "Female" data sets. (XLS 14 KB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
About this article
Cite this article
Thomas, R., de la Torre, L., Chang, X. et al. Validation and characterization of DNA microarray gene expression data distribution and associated moments. BMC Bioinformatics 11, 576 (2010). https://doi.org/10.1186/1471-2105-11-576
- Gene Expression Data
- Gene Expression Omnibus
- Distributional Assumption
- Pareto Distribution
- Microarray Platform