Comprehensive analysis of correlation coefficients estimated from pooling heterogeneous microarray data
- Márcia M Almeida-de-Macedo^{1, 2}Email author,
- Nick Ransom^{1},
- Yaping Feng^{1},
- Jonathan Hurst^{1} and
- Eve Syrkin Wurtele^{1}
DOI: 10.1186/1471-2105-14-214
© Almeida-de-Macedo et al.; licensee BioMed Central Ltd. 2013
Received: 28 August 2012
Accepted: 21 June 2013
Published: 4 July 2013
Abstract
Background
The synthesis of information across microarray studies has been performed by combining statistical results of individual studies (as in a mosaic), or by combining data from multiple studies into a large pool to be analyzed as a single data set (as in a melting pot of data). Specific issues relating to data heterogeneity across microarray studies, such as differences within and between labs or differences among experimental conditions, could lead to equivocal results in a melting pot approach.
Results
We applied statistical theory to determine the specific effect of different means and heteroskedasticity across 19 groups of microarray data on the sign and magnitude of gene-to-gene Pearson correlation coefficients obtained from the pool of 19 groups. We quantified the biases of the pooled coefficients and compared them to the biases of correlations estimated by an effect-size model. Mean differences across the 19 groups were the main factor determining the magnitude and sign of the pooled coefficients, which showed largest values of bias as they approached ±1. Only heteroskedasticity across the pool of 19 groups resulted in less efficient estimations of correlations than did a classical meta-analysis approach of combining correlation coefficients. These results were corroborated by simulation studies involving either mean differences or heteroskedasticity across a pool of N > 2 groups.
Conclusions
The combination of statistical results is best suited for synthesizing the correlation between expression profiles of a gene pair across several microarray studies.
Background
There is a wealth of information enclosed in the massive amount of microarray data so far accumulated in public repositories. The variety of data sets generated from the assortment of experiments is a major obstacle in the path leading from these data to information. Specific issues relating to data heterogeneity across microarray studies include differences across platforms, differences within and between labs, and/or differences among experimental factors such as treatments and tissues [1, 2]. Furthermore, concerns regarding integration of studies from multiple sources in general, such as variations in design, research goals, or quality of implementation, add to these issues [3, 4].
Inappropriate integration of microarray data from public repositories could lead to equivocal results [5]. The “Simpson’s paradox” [6], which refers to contradictory statistical results obtained when analysis is performed within versus across groups of data [7], is an example of mishandling of data. Blyth [8] gives an example involving the analysis of 2x2 contingency tables across two groups, and Hassler and Thadewald [9] also illustrate Simpson’s paradox when correlation coefficients are estimated from a pool of two groups versus within each group. In both cases, the paradox can be explained as results are further investigated in light of the specific statistical properties of each group of data. The “ecological fallacy” happens when the correlation of aggregated variables results in a significant relationship that is due only to aggregation rather than to any real association [10] (p. 285). An early example of an ecological fallacy can be found in Gehlke and Biehl [11], whose study of grouping effects in census tract data showed that the magnitude of correlation coefficients of two variables tend to increase as the level of census tract aggregation increases. This problem was later referred to as the “modifiable areal unit problem” and further studied by Openshaw and Taylor [12].
Combining statistical results (e.g., parameter estimates, p-values) of independent studies that address similar questions has been a standard procedure in classic meta-analyses [4, 13]. This approach entails analyzing each data set independently and then combining the results, as in a mosaic. Meta-analysis of microarray data has been applied in a broader context, as some works include data spanning a wide range of purposes and designs. Parmigiani et al.[14], in a quest for a common gene signature across multiple cancer types, developed a statistical method to identify and assess the intersection of multiple gene expression signatures across 40 published cancer-related microarray studies. On the other hand, Wirapati et al.[15] and Rhodes et al.[16] developed specific meta-analysis methods to integrate gene expression signatures of breast and lung cancer, respectively, across independent studies of microarray data. Hu et al.[17] and Borozan et al.[18] proposed methods that extend traditional effect-size models to combine information from different microarray studies as a way to evaluate or unify lists of genes differentially expressed across them.
Another approach combines data from multiple microarray studies (termed “pooled data”) in a melting pot of data and analyzes them as a single data set. Kim and Webster [19] used public databases containing microarray data and biological traits on cytoarchitectural abnormalities from the same samples of patients belonging to three groups of major mental disorders plus a control group. Their study used gene expression data measured through two array types, the Affymetrix Human Genome U133 Set A and the Affymetrix Human Genome 95av2, and the authors carried out a correlation analysis between each gene expression and the biological traits of each subject; although not fully described in the paper, it seems the correlation analysis was performed on the pooled data set from independent studies. Subsequent gene ontology (GO) [20] enrichment analysis revealed significant overrepresentation of biological processes, such as cellular metabolism, central nervous system development, cell motility, and programmed cell death, in groups of genes that were significantly correlated with biological traits. Mentzen and Wurtele [21] and Horan et al.[22] have created co-expression networks for Arabidopsis thaliana based on parameters of co-expression similarity that were estimated from a large pool of microarray data downloaded from public repositories. Mentzen and Wurtele [21] pooled data from 963 Affymetrix gene chips, distributed across 71 independent studies encompassing diverse organs, conditions, and genotypes “to quest the transcriptome in response to a wide variety of environmentally, genetically, and developmentally induced perturbations”. Horan et al.[22] pooled data corresponding to 1310 Affymetrix microarrays divided among 41 independent studies. Both works used cluster analysis based on Pearson correlation coefficients as a measure of similarity of gene expression profiles from Arabidopsis. Mentzen and Wurtele [21] analyzed data from 21,000 gene probes on the gene chip and identified clusters of co-expressed genes as regulons. Horan et al.[22] used clusters to identify groups of co-regulated protein of unknown function and protein of known function encoding genes from Arabidopsis. In both works, GO enrichment analysis showed that networks based on gene-to-gene correlations estimated from pooling data from multiple microarray studies were not random. A similar approach has been used to obtain regulon information from a human transcriptomic network derived from almost 20,000 microarrays [23]. This analysis also showed a non random functional distribution of regulons.
From a statistical standpoint, combining data from independent microarray studies into a large pool as a single set can be acceptable if data homogeneity can be ensured across studies. Yet, this condition is nearly impossible to ensure considering that significant data heterogeneity is reported even for completely replicated microarray experiments carried out by the same lab [1]. Nevertheless, it can be argued that GO enrichment implies meaningful biology and significant GO enrichment has been shown for networks created from pooled data [19, 21-23]. Moreover, information gathered through a single data set analysis has led to gene function knowledge discovery [24].
The objective of this study was to perform a comprehensive analysis of Pearson correlation coefficients estimated from pooling heterogeneous groups of data (melting pot approach) in a large-scale gene expression analysis of publicly available Affymetrix microarrays and compare it to the analysis (of the same data) that combines statistical results of individual groups (mosaic approach). Our study included two specific objectives: (a) to determine the specific effect of different means and heteroskedasticity across the many groups comprising a pool of microarray data on the sign and magnitude of gene-to-gene Pearson correlation coefficients obtained from the pool of data, and (b) to quantify the extent of bias in gene-to-gene Pearson correlation coefficients obtained from a pool of heterogeneous groups of microarray data.
In the “Methods” section of this article, we describe the statistical theory that we applied to analyze the components of Pearson correlation coefficients obtained from a pool of heterogeneous microarray groups. The “Simulation study” section provides results of a study that further tests the specific effect on Pearson correlation coefficients of only mean differences, and only heteroskedasticity across N > 2 groups, and the validity of our methodology when groups have a small number of elements. In the section “Application to experimental microarray data” we illustrate the results predicted by both theory and simulation with data from 10 microarray experiments. At the end of this section, we provide an assessment of the bias of correlation coefficients estimated across a pool of heterogeneous groups of microarray data. We discuss our results and summarize our conclusions in the last section.
Methods
Dissecting components of the Pearson correlation coefficient obtained from a pool of microarray data
Hassler and Thadewald [9] developed the asymptotic formulation to quantify and explain differences between the Pearson correlation coefficient estimated from combining two heterogeneous groups into one pool and the Pearson correlation coefficients estimated within each group. They illustrated the problem with a set of measurements on height (cm) and weight (kg) reported by 184 first-year college students with a roughly even number of males and females. As the authors emphasized in their work, male and female groups were not homogeneous because “male students are taller and heavier than female students, and variation around the mean also differs between the groups”.
The generalization of Hassler and Thadewald’s [9] asymptotic analysis for N heterogeneous groups (refer to their original work for specifics about the asymptotic analysis) is provided in Equation 1. For the purpose of applying their theoretical work to analyze correlation coefficients obtained from a pool of microarray data, we consider N heterogeneous groups of gene expression data measured through microarrays. Each group of data can be described as a matrix M_{ i } of g genes by n_{ i } columns (each column of the matrix M_{ i } corresponds to the expression of g genes measured through one microarray). We assume that expression levels of any given gene pair xy within each group, i.e. x,y ∈ M_{ i }, are bivariate random normal variables that are identically distributed with means μ_{xy,i} = (μ_{x,i},μ_{y,i}) and variance-covariance matrix ${\mathrm{\Sigma}}_{\mathit{\text{xy}},i}=\left(\begin{array}{cc}{\sigma}_{x,i}^{2}& {\sigma}_{\mathit{\text{xy}},i}\\ {\sigma}_{\mathit{\text{xy}},i}& {\sigma}_{y,i}^{2}\end{array}\right)$, ∀ i = 1,N. Therefore, heterogeneities across N groups of microarray data are characterized by μ_{xy,i} ≠ μ_{xy,j} and/or Σ_{xy,i} ≠ Σ_{xy,i}, for i ≠ j.
Results
Simulation study
This section presents results of a study using simulated data that had the purpose of further investigating correlation coefficients obtained from a pool of N groups under the following specific conditions: (a) occurrence of only mean differences across N > 2 groups and (b) occurrence of only heteroskedasticity across N > 2 groups. Our study also evaluated the validity of using estimates of the asymptotic expression given in Equation 1 to explain the components of the pooled correlation coefficient when groups comprising the pool of data have a small number of elements n_{ i }.
- 1.
For each simulated group i, i = 1,N, generate n _{ i } data pairs from a multivariate normal distribution with parameters μ _{xy,i} and Σ_{xy,i} (we used the function mvrnorm in MASS [26]). Each simulated group is a matrix of 2 rows (genes) by n _{ i } columns (number of elements n _{ i }).
- 2.
Combine simulated data of N groups into one pool of 2 rows (genes) by $\sum _{i=1}^{N}{n}_{i}$ columns.
- 3.
Obtain the Pearson correlation coefficient from the pool of data (we used the function cor in R [25]).
- 4.
Repeat steps 1-3 above 1000 times; results are presented as averages over 1000 repetitions.
As a control, we first performed an experiment with parameters μ_{xy,i} = (0,0) and ${\mathrm{\Sigma}}_{\mathit{\text{xy}},i}=\left(\begin{array}{cc}1& 0\\ 0& 1\end{array}\right)$for all groups i = 1,10. This simulation provided nearly zero correlation coefficients (−0.004 ≤ r_{ xy } ≤ 0.003), thus reassuring us that our simulation procedure worked as expected.
Simulation of only mean differences across a pool of Ngroups
Through Equation 4 one can see that $\frac{-4{\lambda}^{2}}{4{\lambda}^{2}+4\lambda (1-2\lambda )+1}\approx 0$ for α = 2 and 0.01 ≤ λ ≤ 0.1; τ_{ xy } > 0 for α < 2 because the term −4λ(1−2λ)(α−2) > 0 and dominates the term −λ^{2}α^{2} < 0; τ_{ xy } < 0 for α > 2 because −4λ(1−2λ)(α−2) < 0 and −λ^{2}α^{2} < 0.
Through Equation 5 one can see that τ_{ xy } assumes values of nearly −1 for 10 ≤ N ≤ 100. A visualization of the problem considered in our second simulation case study is shown (Figure 2b) through a scatterplot of expression data of a gene pair xy simulated for 10 groups according to the parameters μ_{xy,i} = (i,(11−i)), ρ_{ xy } = 0.9, and n_{ i } = 50, for i = 1,10. As shown in Figure 2b, even though there is a positive trend between the expressions of gene x and gene y within each of the 10 groups, the trend between expression of the gene pair xy across the pool of all groups is negative.
Simulation of only heteroskedasticity across a pool of Ngroups
Through simulation, we analyzed the effect of variations in ${\mathrm{\Sigma}}_{\mathit{\text{xy}},i}=\left(\begin{array}{cc}{\sigma}_{x,i}^{2}& {\sigma}_{\mathit{\text{xy}},i}\\ {\sigma}_{\mathit{\text{xy}},i}& {\sigma}_{y,i}^{2}\end{array}\right)$among N groups, keeping μ_{xy,i} = μ constant. Simulation of the case where σ_{xy,i} = 0 but ${\sigma}_{x,i}^{2}\ne {\sigma}_{x,j}^{2}$ and ${\sigma}_{y,i}^{2}\ne {\sigma}_{y,j}^{2}$ for i ≠ j resulted in nearly zero correlation coefficients from the pool of N groups (−0.04 ≤ r_{ xy } ≤ 0.05). This result agrees with Equation 1, which predicts zero correlation if covariances and mean differences of a gene pair across all groups are zero, even when variances differ across groups.
where ${\stackrel{\u0304}{\sigma}}_{x}=\frac{\sum _{i=1}^{N}{\sigma}_{x,i}}{N}$ and $\stackrel{\u0304}{{\sigma}_{x}^{2}}=\frac{\sum _{i=1}^{N}{\sigma}_{x,i}^{2}}{N}$. Equation 6 also shows a linear relationship between τ_{ xy } and ρ with a slope of $\frac{\stackrel{\u0304}{{\sigma}_{x}}}{\sqrt{\stackrel{\u0304}{{\sigma}_{x}^{2}}}}=0.872$, for σ_{x,i} = i^{2}, i = 1,N, and 10 ≤ N ≤ 100.
There is also a linear relationship between the average of correlation coefficients within each group ${\stackrel{\u0304}{r}}_{\mathit{\text{xy}}}$ and ρ (Figure 3), in which $-0.889<{\stackrel{\u0304}{r}}_{\mathit{\text{xy}}}<+0.889$. In addition, as shown in Figure 3, $\left|{\stackrel{\u0304}{r}}_{\mathit{\text{xy}}}\right|>\left|{r}_{\mathit{\text{xy}}}\right|$. Hence, it can be easily inferred that the mean squared error between r_{ xy } and ρ (the true correlation within each group) is greater than the mean squared error between ${\stackrel{\u0304}{r}}_{\mathit{\text{xy}}}$ and ρ. Therefore, pools of N simulated groups marked by only heteroskedasticity provide less efficient estimates of the correlation across groups than combining the N groups’ correlation coefficients into an average.
We analyzed the mean squared error between τ_{ xy } and ${\widehat{\tau}}_{\mathit{\text{xy}}}$ versus n_{ i }, the number of simulated elements in each group, for 10 ≤ n_{ i } ≤ 100 and N = 20 (data shown in Additional file 1). τ_{ xy } was obtained from plugging population parameters μ_{xy,i} and Σ_{xy,i} into Equation 1, whereas ${\widehat{\tau}}_{\mathit{\text{xy}}}$ was based on a combination of parameters of each group ${\widehat{\mu}}_{\mathit{\text{xy}},i}$ and ${\widehat{\mathrm{\Sigma}}}_{\mathit{\text{xy}},i}$ (see Equation 7). The mean squared error ranged from 0.0004 for n_{ i } = 100 to 0.004 for n_{ i } = 10. The correspondence between τ_{ xy } and ${\widehat{\tau}}_{\mathit{\text{xy}}}$ was good even for n_{ i } = 10, a small number of simulated elements per group.
Application to experimental microarray data
Our simulation study showed that Pearson correlation coefficients obtained from a pool of data coming from groups that have different means are explained solely by mean differences across groups. Furthermore, we showed that pooling data marked by only heteroskedasticity provides less efficient estimates of correlation coefficients than does a classical meta-analysis approach of combining correlation coefficients into an average. The following analysis of experimental microarray data illustrates the results predicted by both theory and simulation.
Example data set
Description of the the example data set
TAIR ID | Abiotic stress | Cel files | Plant material | n _{ i } | ${\mathit{n}}_{i}^{\ast}$ |
---|---|---|---|---|---|
ME00325 | Cold | 48 | Root | 12 | 24 |
Shoot | 12 | 24 | |||
ME00326 | Genotoxic | 47 | Root | 12 | 22 |
Shoot | 12 | 24 | |||
ME00327 | Osmotic | 48 | Root | 12 | 24 |
Shoot | 12 | 24 | |||
ME00328 | Salt | 48 | Root | 12 | 24 |
Shoot | 12 | 24 | |||
ME00329 | UVB | 56 | Root | 14 | 28 |
Shoot | 14 | 28 | |||
ME00330 | Wound | 56 | Root | 14 | 28 |
Shoot | 14 | 28 | |||
ME00338 | Drought | 56 | Root | 14 | 28 |
Shoot | 14 | 28 | |||
ME00339 | Heat | 67 | Root | 18 | 36 |
Shoot | 16 | 30 | |||
ME00340 | Oxidative | 48 | Root | 12 | 24 |
Shoot | 12 | 24 | |||
ME00345 | Light | 48 | Seedlings | 16 | 48 |
Experimental data analysis
We imported data from cel files into the R environment [25] and processed the data with MAS5 from the open-source Bioconductor R package affy [29, 30]. Following the methodology described in Horan et al.[22], we did not screen our example data set for quality of biological replicates, and therefore no outliers were removed. We followed this procedure because the same data from the 10 experiments of AtGenExpress [27] were also part of the larger data set used in the work by Horan et al.[22]. As described in the methodology of Mentzen and Wurtele [21], all data were subsequently normalized using the median absolute deviation method as performed by the function normalizeBetweenArrays (with the option “scale”) from the open-source Bioconductor R package Limma [30, 31]. We obtained mean values of biological replicates after a log transformation (base 2) of the normalized expression data. Because the two treatment conditions “genotoxic stress applied to root 1 hour post-treatment” and “heat control applied to shoot 24 hours post-treatment” had data for only one biological replicate, their expression measurements were used as mean values (refer to Additional file 2 for more details). Thereafter, mean values of biological replicates were combined into one large expression matrix (pooled data) encompassing 254 columns and 22,810 rows (corresponding to probe ids/genes). All but two columns of the large expression matrix resulted from averaging data of two or three biological replicates (refer to Additional file 2 for exact number of biological replicates per treatment condition). Gene-to-gene Pearson product-moment correlation coefficients (r_{ x y }) were obtained from the large expression matrix (pooled data) with the R function cor. [25].
Mean differences across a pool of microarray data
We used estimates of the asymptotic expression given in Equation 1 to examine the makeup of Pearson correlation coefficients obtained from pooling the means of biological replicates of different experimental conditions into one large expression matrix. In order to accomplish this task, we classified data in columns of the large expression matrix into 19 groups. Each group had gene expression values from either root or shoot in each of nine types of abiotic stress treatments (see Table 1 for details). We adopted this procedure because an exploratory analysis showed clear mean differences in gene levels expressed in root or shoot. The light stress experiment was for entire seedlings, and our analysis did not show mean differences that would justify further division of the data from this experiment. Each group’s name and its corresponding number of elements n_{ i }, for groups i = 1,19, are given in Table 1 (number of elements n_{ i } correspond to the number of mean expression values of a gene within group i).
Data across 19 groups were obviously not homogeneous because each group corresponds to a combination of the type of abiotic stress and the plant material, which surely would have an effect on the total group mean of a gene expression. In addition, data within groups cannot strictly be considered homogeneous either because gene mean expression values within groups correspond to different time points post application of abiotic stress/control treatments (further details about treatment conditions inside and across groups is given in Additional file 2). Because our exploratory analysis indicated that the grand mean expression level of genes within groups seemed to dominate over means of all other treatment effects (data not shown), we considered data within groups as roughly homogeneous.
- 1.
For a given gene pair xy, obtain estimates${\widehat{\mu}}_{\mathit{\text{xy}},i}=({\stackrel{\u0304}{x}}_{i},{\stackrel{\u0304}{y}}_{i})$ and${\widehat{\mathrm{\Sigma}}}_{\mathit{\text{xy}},i}=\left(\begin{array}{ll}{s}_{x,i}^{2}& {s}_{\mathit{\text{xy}},i}\\ {s}_{\mathit{\text{xy}},i}& {s}_{y,i}^{2}\end{array}\right)$ for all groups i = 1,19. Here${\stackrel{\u0304}{x}}_{i}$ and${\stackrel{\u0304}{y}}_{i}$ are, respectively, group means of expression levels of genes x and y, and${s}_{x,i}^{2}$,${s}_{y,i}^{2}$, and s _{xy,i} are group variances and covariances of expression levels of genes x and y, respectively.
- 2.Estimate asymptotic coefficients${\widehat{\tau}}_{\mathit{\text{xy}}}$ as${\widehat{\tau}}_{\mathit{\text{xy}}}=\frac{{\stackrel{\u0304}{s}}_{\mathit{\text{xy}}}+{d}_{\mathit{\text{xy}}}}{{d}_{x}{d}_{y}}$(7)where${\stackrel{\u0304}{s}}_{\mathit{\text{xy}}}=\sum _{i=1}^{19}{\lambda}_{i}{s}_{\mathit{\text{xy}},i}$(8)${d}_{\mathit{\text{xy}}}=\sum _{i=1}^{19}\sum _{j>i}^{19}{\lambda}_{i}{\lambda}_{j}({\stackrel{\u0304}{x}}_{i}-{\stackrel{\u0304}{x}}_{j})({\stackrel{\u0304}{y}}_{i}-{\stackrel{\u0304}{y}}_{j})$(9)$\begin{array}{l}{d}_{x}^{2}=\stackrel{\u0304}{{s}_{x}^{2}}+\sum _{i=1}^{19}\sum _{j>i}^{19}{\lambda}_{i}{\lambda}_{j}{({\stackrel{\u0304}{x}}_{i}-{\stackrel{\u0304}{x}}_{j})}^{2}\end{array}$(10)$\begin{array}{l}{d}_{y}^{2}=\stackrel{\u0304}{{s}_{y}^{2}}+\sum _{i=1}^{19}\sum _{j>i}^{19}{\lambda}_{i}{\lambda}_{j}{({\stackrel{\u0304}{y}}_{i}-{\stackrel{\u0304}{y}}_{j})}^{2}\end{array}$(11)$\begin{array}{l}\stackrel{\u0304}{{s}_{x}^{2}}=\sum _{i=1}^{19}{\lambda}_{i}{s}_{x,i}^{2}\end{array}$(12)$\begin{array}{l}\stackrel{\u0304}{{s}_{y}^{2}}=\sum _{i=1}^{19}{\lambda}_{i}{s}_{y,i}^{2}\end{array}$(13)
- 3.
Use the residual error (${r}_{\mathit{\text{xy}}}-{\widehat{\tau}}_{\mathit{\text{xy}}}$) to compare Pearson correlation coefficients as obtained from the large expression matrix and coefficients as estimated through Equation 7, which are based on parameter estimates of 19 groups of data.
- 4.
Small residual errors indicate good agreement between r _{ xy } and${\widehat{\tau}}_{\mathit{\text{xy}}}$. As a result, the two components of Equation 7 (i.e. the weighted average of all covariances across 19 groups and the weighted average of the cross product of mean differences of a gene pair xy across 19 groups) can explain the signs and magnitudes of r _{ xy }, the Pearson correlation coefficient obtained from the large expression matrix.
Pearson correlation coefficients obtained from the pooled expression data
We first obtained Pearson correlation coefficients for pairwise combinations of all 22,810 genes present in the large expression matrix, which resulted in more than 260 million coefficients. The Pearson correlation coefficients ranged from −0.992 to 0.998 with 0.008 as the mean value, and coefficients showed a symmetric distribution around zero; roughly 10% of these coefficients were greater than 0.7 or less than −0.7 (data shown in Additional file 1).
The analysis of all 124,750 pairwise correlations of 500 randomly selected genes revealed good agreement between r_{ xy } and${\widehat{\tau}}_{\mathit{\text{xy}}}$, despite the approximations we made about homogeneity of data within groups and the relatively low number of elements in each group; in our example data set 12 ≤ n_{ i } ≤ 18, whereas Hassler and Thadewald’s example data set had around 90 elements in each of two groups [9]. Therefore, our analysis reassured us that the Pearson correlation coefficients obtained from the large expression matrix can be explained by heterogeneities due to different means and variances-covariances across the 19 groups we used to classify our example data set.
Because |r_{ xy }| > 0.7 obtained from pools of microarray data has been used as the cut-off value representing a strong association between gene pairs [21-23], we computed the percentage contribution of the covariance and mean differences terms on the magnitude of |r_{ xy }| ≥ 0.7, i.e.$\frac{{\stackrel{\u0304}{s}}_{\mathit{\text{xy}}}}{{r}_{\mathit{\text{xy}}}{d}_{x}{d}_{y}}+\frac{{d}_{\mathit{\text{xy}}}}{{r}_{\mathit{\text{xy}}}{d}_{x}{d}_{y}}\approx 1$. There were 10,567 correlation coefficients with roughly equal numbers distributed in the r_{ xy } < −0.7 and r_{ xy } > 0.7 categories. The median of$\frac{{\stackrel{\u0304}{s}}_{\mathit{\text{xy}}}}{{r}_{\mathit{\text{xy}}}{d}_{x}{d}_{y}}\%$ was 1.98% with 50% of the data showing values between 0.04% and 5.32%. Conversely, the median of$\frac{{d}_{\mathit{\text{xy}}}}{{r}_{\mathit{\text{xy}}}{d}_{x}{d}_{y}}\%$ was 96.93% with 50% of the data showing values between 93.51% and 98.98%.
A combination of correlation coefficients between expression profiles within each group, given by${\stackrel{\u0304}{r}}_{\mathit{\text{xy}}}=\sum _{i=1}^{19}{\lambda}_{i}{r}_{\mathit{\text{xy}},i}$, ranged from$-0.6<{\stackrel{\u0304}{r}}_{\mathit{\text{xy}}}<0.89$ with 0.132 as the mean value. A direct comparison between r_{ xy } and${\stackrel{\u0304}{r}}_{\mathit{\text{xy}}}$ showed a correlation coefficient of 0.3.
By applying the asymptotic theory developed by Hassler and Thadewald [9] to the Pearson correlation coefficients obtained from the large expression matrix, we showed that differences in means across 19 heterogeneous groups of data is the main factor determining the magnitude and sign of coefficients of 124,750 gene pairs. As previously shown by Hassler and Thadewald [9], this result corroborates that gene pairwise correlation coefficients estimated from a pool of microarray data do not measure “the closeness of linear relationship” [34] (p. 177) between expressions of a gene pair. Instead, they measure the extent of mean differences of a gene pair across different groups comprising the pool of data.
Heteroskedasticity across a pool of microarray data
For instance, heteroskedasticity could occur in a situation in which data from completely replicated microarray experiments are pooled to be examined as one data set. As was reported in the work of Goldstein et al.[1], data variability could differ substantially across replicated microarray experiments.
where${\widehat{\beta}}_{i}$ was estimated using the open-source Bioconductor R package Limma [30, 31]. This approach is equivalent to subtracting expression levels of each biological replicate from their mean values. We used linear models because they are well known by the community who works with differential expression of microarrays measurements and the process of obtaining their residuals is easy and automatic through the use of the Limma package [30, 31].
We combined all gene expression residuals from the 19 groups into one pool of residuals (a large matrix of residuals including 520 columns and 22,810 rows). Expression levels of the two treatment conditions “genotoxic stress applied to root 1 hour post-treatment” and “heat control applied to shoot 24 hours post-treatment” could not be used in the analysis of residuals because they had only one biological replicate (refer to Additional file 2 for more details). This explains why the matrix of residuals has 520 columns instead of 522 columns. We repeated the analysis described in steps 1-4 from the section “Application to experimental microarray data” for the data in the large matrix of residuals.
Pearson correlation coefficients estimated from the pooled residuals
Here we show results of the analysis involving the large matrix of residuals (pooled residuals) for the same subset of 500 genes used in the analysis of the large expression matrix. Pearson correlation coefficients (${r}_{\mathit{\text{xy}}}^{\ast}$) of all 124, 750 pairwise combinations of 500 genes obtained from the large matrix of residuals ranged from −0.553 to 0.849 with 0.01 as the mean value. Their asymptotic counterparts (${\widehat{\tau}}_{\mathit{\text{xy}}}^{\ast}$), estimated according to Equation 7, ranged from −0.554 to 0.849 with 0.01 as the mean value. The combination of covariances within each of the 19 groups, i.e.$\sum _{i=1}^{19}{\lambda}_{i}{s}_{\mathit{\text{xy}},i}$, determined the sign of${r}_{\mathit{\text{xy}}}^{\ast}$ because all pairwise mean differences among groups were zero (data shown in Additional file 1).
We then compared${r}_{\mathit{\text{xy}}}^{\ast}\approx \frac{\sum _{i=1}^{19}{\lambda}_{i}{s}_{\mathit{\text{xy}},i}}{\sqrt{\sum _{i=1}^{19}{\lambda}_{i}{s}_{x,i}^{2}\sum _{i=1}^{19}{\lambda}_{i}{s}_{y,i}^{2}}}$ to the weighted average of correlations obtained within each of the 19 groups of residuals, i.e.${\stackrel{\u0304}{r}}_{\mathit{\text{xy}}}^{\ast}=\sum _{i=1}^{19}{\lambda}_{i}{r}_{\mathit{\text{xy}},i}^{\ast}$;${\stackrel{\u0304}{r}}_{\mathit{\text{xy}}}^{\ast}$ ranged from −0.631 to 0.847 with 0.011 as the mean value. In addition, we observed a strong linear relationship between${r}_{\mathit{\text{xy}}}^{\ast}$ estimated from the large matrix of residuals and${\stackrel{\u0304}{r}}_{\mathit{\text{xy}}}^{\ast}$, with a correlation equal to 0.93. Therefore, the Pearson correlation coefficients obtained from the large matrix of residuals (whose heterogeneities result from different variances-covariances across the 19 groups) also measure a linear relationship between the expression residuals of a gene pair.
Bias of correlation coefficients obtained across 19 groups of microarray data
where${\widehat{\rho}}_{\mathit{\text{xy}}}$ represents the correlation point estimate of a gene pair xy across the 19 groups of microarray data and${\widehat{\rho}}_{\mathit{\text{xy}},i}$ represents its counterparts within each group.
Statistical summaries of biases of correlation coefficients
Min. | 1st Qu. | Median | 3rdQu. | Max. | |
---|---|---|---|---|---|
B(r_{ xy }) | 0.022 | 0.078 | 0.099 | 0.132 | 0.308 |
$B\left({\stackrel{\u0304}{r}}_{\mathit{\text{xy}}}\right)$ | 0.021 | 0.062 | 0.072 | 0.083 | 0.153 |
$B\left({r}_{\mathit{\text{xy}}}^{\ast}\right)$ | 0.019 | 0.049 | 0.057 | 0.067 | 0.141 |
$B\left({\stackrel{\u0304}{r}}_{\mathit{\text{xy}}}^{\ast}\right)$ | 0.018 | 0.049 | 0.057 | 0.065 | 0.132 |
The analysis involving the data in the large expression matrix (whose heterogeneities were due to means and variances-covariances differences across 19 groups) resulted in consistently larger statistical summaries of B(r_{ xy }) than did those of$B\left({\stackrel{\u0304}{r}}_{\mathit{\text{xy}}}\right)$. In addition, the maximum value of B(r_{ xy }) is twice as much the maximum value of$B\left({\stackrel{\u0304}{r}}_{\mathit{\text{xy}}}\right)$ (Table 2). For the large matrix of residuals (whose heterogeneities were due only to heteroskedasticity), the values of$B\left({r}_{\mathit{\text{xy}}}^{\ast}\right)$ shown in Table 2 are slightly larger than are the values of$B\left({\stackrel{\u0304}{r}}_{\mathit{\text{xy}}}^{\ast}\right)$.
Discussion and conclusion
Discussion
In this study, we performed a comprehensive analysis of Pearson correlation coefficients obtained from combining data of 19 heterogeneous groups of experimental microarray data into one large pool. By applying the theory developed by Hassler and Thadewald [9] to our example data set, we determined the specific effect of mean differences and heteroskedasticity across the 19 groups on the sign and magnitude of the pooled coefficients. In addition, we provided a performance metric for correlation coefficients by quantifying their biases.
We quantified the bias of the correlation coefficient of a gene pair through the mean squared error between its estimate across a pool of groups and its estimates within groups. A similar method has been used by Hunter and Schmidt to assess the variance of their meta-analysis estimator of the Pearson correlation coefficient across independent studies [36]. We evaluated the bias of gene-to-gene correlations estimated according to the following two methods: (a) by combining 19 groups of microarray data into a large pool to be analyzed as a single data set (pooled data) and (b) by combining correlation coefficients of each of 19 groups of microarray data into an average weighted by the number of elements in each group, which corresponds to the Hunter-Schmidt meta-analysis estimator of the Pearson correlation coefficient across independent studies [36].
The data used in this study came from 10 microarray experiments (AtGenExpress Project [27]) carried out by seven different laboratories distributed across Germany that followed the same experimental protocol; these are a subset of the large pool of microarray data found in the study of Horan et al.[22]. Experiments followed a 3-factorial design with treatment (abiotic stress, control), tissue (root, shoot, or seedlings in general), and time post-treatment as factors [27]. Mean differences within and across experiments were a matter-of-fact because statistically significant differences in gene expression of several types of abiotic stress versus control treatment were reported in Kilian et al.’s study [27]. We expected differences due to variability across experiments to remain after removing mean differences because of reported difficulties in the reproduction of microarray studies [1]. Therefore, homogeneity cannot be ensured across experiments, and combining the means or residuals of biological replicates of the 10 experiments into a large pool as a single set is not sound from a statistical viewpoint.
The analysis of the components of the correlation coefficients obtained from the large expression matrix corroborated the results predicted by both theory and simulation that variances-covariances within the 19 groups had negligible impact on correlation coefficients, but different means across the 19 groups had a decisive effect on the sign as well as on the magnitude of coefficients. Coefficients that were greater than 0.7 or less than −0.7 showed the largest range of bias (Table 2). Therefore, large values of the pooled coefficients were an artifact in the sense that they did not communicate a real linear association between the expression profiles of two genes; rather, they appeared because the data were combined into a large pool. For this reason, large values of the pooled coefficients are in fact an ecological fallacy [10].
We also showed through Monte Carlo simulation that the structure of different means across a pool of 10 ≤ N ≤ 100 groups could generate Simpson’s paradox. In our case study simulation shown in Figure 2, we showed that even though the correlation within each group was +0.9, a pool of N (10 ≤ N ≤ 100) groups provided negative correlation coefficients because the combination of all pairwise mean differences had a negative sign and greater magnitude than the positive covariance of the data within groups. Hassler and Thadewald [9] studied Simpson’s paradox through the analytical solution of Equation 1 for N = 2, and showed that the occurrence of mean differences with opposite signs in both correlated variables is a condition for contradictory results between a correlation coefficient that is estimated across or within each of two groups.
We combined residuals from fitting linear models of every gene into a large matrix of residuals (22,810 rows x 520 columns). Here we departed from the assumption of independence (common to the analysis of differentially expressed genes [35, 37]) and considered a multivariate normal distribution for residuals within groups, with a mean of zero and variance-covariance Σ_{xy,i}, i = 1,19. The large matrix of residuals gave us the opportunity to evaluate gene pair correlations estimated from a pool of data marked by only heteroskedasticity. Our results showed that correlation coefficients estimated across the 19 groups of residuals were closely related to the variance-covariances within groups. We also found a strong linear relationship between the Pearson correlation coefficients obtained from the large matrix of residuals and the coefficients resulting from averaging correlation estimates within groups. However, the heteroskedasticity of the data in the large matrix of residuals resulted in less efficient estimations of the correlation between a gene pair than did the classical meta-analysis approach of combining correlation coefficients into an average. These results were corroborated by Monte Carlo simulations of only heteroskedasticity across N > 2 groups of data.
The results shown in this study indicate that the combination of statistical results is best suited for estimating correlations of a gene pair across several microarray studies. Nevertheless, further studies are necessary to assess various methods of combining within-groups gene-to-gene correlation coefficients.
Conclusion
- (A)
Large values of gene-to-gene Pearson correlation coefficients estimated from a pool of 19 groups of microarray data were an ecological fallacy; the effect of heterogeneous means across a pool of data overpowers the covariance structure of the data.
- (B)
The effect of heterogeneous variance-covariances across a pool of data causes less efficient estimates of Pearson correlation coefficients across groups of microarray data than does the approach of combining correlation coefficients of individual groups.
- (C)
The combination of statistical results is best suited for synthesizing the correlation between expression profiles of a gene pair across several microarray studies.
Declarations
Acknowledgements
Many thanks to Ron Mowers for helpful discussions and support. Many thanks to Di Cook for suggestions on the statistical analysis, design of the simulation experiment, and data visualization with the R package ggplot2. We thank two anonymous reviewers for their insightful comments.
Authors’ Affiliations
References
- Goldstein DR, Delorenzi M, Luthi-Carter R, Sengstag T: Comparison of meta-analysis to combined analysis of a replicated microarray study. Meta-Analysis and Combining Information in Genetics and Genomics, Volume 1. Edited by: Guerra R, Goldstein DR. 2010, Boca Raton: Chapman and Hall, 135-156.Google Scholar
- Irizarry RA, Warren D, Spencer F, Kim IF, Biswal S, Frank BC, Gabrielson E, Garcia JGN, Geoghegan J, Germino G, Griffin C, Hilmer SC, Hoffman E, Jedlicka AE, Kawasaki E, Martínez-Murillo F, Morsberger L, Lee H, Petersen D, Quackenbush J, Scott A, Wilson M, Yang Y, Qing SY, Yu W: Multiple-laboratory comparison of microarray platforms. Nat Methods. 2005, 2: 345-349. 10.1038/nmeth756.View ArticlePubMedGoogle Scholar
- Goldstein DR, Guerra R: A brief introduction to meta-analysis, genetics and genomics. Meta-Analysis and Combining Information in Genetics and Genomics, Volume 1. Edited by: Guerra R. 2010, Boca Raton: Chapman and Hall, 3-20.Google Scholar
- Hedges LV, Olkin I: Statistical Methods for Meta-Analysis. 1985, Orlando: Academic PressGoogle Scholar
- De Veaux RD, Hand DJ: How to lie with bad data. Stat Sci. 2005, 20: 231-238. 10.1214/088342305000000269.View ArticleGoogle Scholar
- Simpson EH: The interpretation of interaction in contingency tables. J Roy Stat Soc Ser B. 1951, 13: 238-241.Google Scholar
- Ooi YH: Simpson’s paradox - a survey of past, present and future research. Wharton Research Scholars Journal. Edited by: Scholarly Commons, Scholarly Commons . 2004, University of PennsylvaniaGoogle Scholar
- Blyth CR: On Simpson’s paradox and the sure thing principle. JASA. 1972, 67: 364-366. 10.1080/01621459.1972.10482387.View ArticleGoogle Scholar
- Hassler U, Thadewald T: Nonsensical and biased correlation due to pooling heterogeneous samples. Statistician. 2003, 52: 367-379.Google Scholar
- Cressie NAC: Statistics for Spatial Data. 1993, New York: John Wiley & SonsGoogle Scholar
- Gehlke CE, Biehl K: Certain effects of grouping upon the size of the correlation coefficient in census tract material. JASA. 1934, 29: 169-170. [http://www.jstor.org/stable/2277827]Google Scholar
- Openshaw S, Taylor PJ: A million or so correlation coefficients: three experiments on the modifiable areal unit problem. Statistical Applications in the Spatial Sciences. Edited by: Wrigley N. 1979, London: Pion Limited, 127-144.Google Scholar
- Brockwell SE, Gordon IR: A comparison of statistical methods for meta-analysis. Stat Med. 2001, 20: 825-840. 10.1002/sim.650.View ArticlePubMedGoogle Scholar
- Parmigiani G, Garrett-Mayer ES, Anbazhagan R, Gabrielson E, S E: A cross-study comparison of gene expression studies for the molecular classification of lung cancer. Clin Cancer Res. 2004, 10: 2922-2927. 10.1158/1078-0432.CCR-03-0490.View ArticlePubMedGoogle Scholar
- Wirapati P, Sotiriou C, Kunkel S, Farmer P, Pradervand S, Haibe-Kains B, Desmedt C, Ignatiadis M, Sengstag T, Schütz F, Goldstein DR, Piccart M, Delorenzi M: Meta-analysis of gene expression profiles in breast cancer: towards a unified understanding of breast cancer subtyping and prognosis signatures. Breast Cancer Res. 2008, 10: R65-10.1186/bcr2124.PubMed CentralView ArticlePubMedGoogle Scholar
- Rhodes DR, Yu J, Shanker K, Deshpande N, Varambally R, Ghosh D, Barrette T, Pandey A, Chinnaiyan AM: Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. PNAS. 2004, 101: 9309-9314. 10.1073/pnas.0401994101.PubMed CentralView ArticlePubMedGoogle Scholar
- Hu P, Greenwood CMT, Beyene J: Integrative analysis of multiple gene expression profiles with quality-adjusted effect size models. BMC Bioinformatics. 2005, 6: 128-10.1186/1471-2105-6-128.PubMed CentralView ArticlePubMedGoogle Scholar
- Borozan I, Chen L, Paeper B, Heathcote JE, Edwards AM, Katze M, Zhang Z, McGilvray ID: MAID : An effect size based model for microarray data integration across laboratories and platforms. BMC Bioinformatics. 2008, 9: 305-10.1186/1471-2105-9-305.PubMed CentralView ArticlePubMedGoogle Scholar
- Kim S, Webster MJ: Correlation analysis between genome-wide expression profiles and cytoarchitectural abnormalities in the prefrontal cortex of psychiatric disorders. Mol Psychiatr. 2010, 15: 326-336. 10.1038/mp.2008.99.View ArticleGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. Nat Genet. 2000, 25: 25-29. 10.1038/75556.PubMed CentralView ArticlePubMedGoogle Scholar
- Mentzen WI, Wurtele ES: Regulon organization of Arabidopsis. BMC Plant Biol. 2008, 8: 99-10.1186/1471-2229-8-99.PubMed CentralView ArticlePubMedGoogle Scholar
- Horan K, Jang C, Bailey-Serres J, Mittler R, Shelton C, Harper JF, Zhu JK, Cushman JC, Gollery M, Girke T: Annotating genes of known and unknown function by large-scale coexpression analysis. Plant Physiol. 2008, 147: 41-57. 10.1104/pp.108.117366.PubMed CentralView ArticlePubMedGoogle Scholar
- Feng YP, Hurst J, Almeida-de Macedo M, Chen X, Li L, Ransom N, Wurtele ES: A massive human co-expression-network and its medical applications. Summit Syst Biol, Chem Biodivers. in pressGoogle Scholar
- Ngaki MN, Louie GV, Philippe RN, Manning G, Pojer F, Bowman ME, Li L, Larsen E, Wurtele ES, Noel JP: Evolution of the chalcone-isomerase fold from fatty-acid binding to stereospecific catalysis. Nature. 2012Google Scholar
- The R project for statistical computing. [http://www.r-project.org/]
- Venables WN, Ripley BD: Modern Applied Statistics with S, fourth edition. 2002, New York: Springer, [http://www.stats.ox.ac.uk/pub/MASS4] [ISBN 0-387-95457-0]View ArticleGoogle Scholar
- Kilian J, Whitehead D, Horak J, Wanke D, Weinl S, Batistic O, D’Angelo C, Bornberg-Bauer E, Kudla J, Harter K: The AtGenExpress global stress expression data set: protocols, evaluation and model data analysis of UV-B light, drought and cold stress responses. Plant J. 2007, 50: 347-363. 10.1111/j.1365-313X.2007.03052.x.View ArticlePubMedGoogle Scholar
- The Arabidopsis Information Resource. [http://www.arabidopsis.org/]
- Gautier L, Cope L, Bolstad BM, Irizarry RA: affy—analysis of Affymetrix GeneChip data at the probe level. Bioinformatics. 2004, 20 (3): 307-315. 10.1093/bioinformatics/btg405.View ArticlePubMedGoogle Scholar
- Bioconductor - open source software for bioinformatics. [http://www.bioconductor.org/]
- Smyth GK: Limma: linear models for microarray data. Bioinformatics and Computational Biology Solutions using R and Bioconductor. Edited by: Gentleman R, Carey V, Dudoit S, Irizarry R, Huber W. 2005, New York: Springer, 397-420.View ArticleGoogle Scholar
- Soper HE: On the probable error of the correlation coefficient to a second approximation. Biometrika. 1913, 9: 91-115.View ArticleGoogle Scholar
- Fisher RA: On the probable error of the correlation coefficient to a second approximation. Biometrika. 1915, 10: 507-521.Google Scholar
- Snedecor GW, Cochran WG: Statistical Methods. 1989, Ames: Iowa State University PressGoogle Scholar
- Smyth GK: Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mo B. 2004, 3 (1):Google Scholar
- Field AP: Meta-analysis of correlation coefficients: a Monte-Carlo comparison of fixed- and random-effects methods. Psychol Methods. 2001, 6: 161-180.View ArticlePubMedGoogle Scholar
- Smyth GK, Yang YH, Speed TP: Statistical issues in microarray data analysis. Method Mol Biol. 2003, 224: 111-136.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.