MAID : An effect size based model for microarray data integration across laboratories and platforms
© Borozan et al. 2008
Received: 27 February 2008
Accepted: 10 July 2008
Published: 10 July 2008
Skip to main content
© Borozan et al. 2008
Received: 27 February 2008
Accepted: 10 July 2008
Published: 10 July 2008
Gene expression profiling has the potential to unravel molecular mechanisms behind gene regulation and identify gene targets for therapeutic interventions. As microarray technology matures, the number of microarray studies has increased, resulting in many different datasets available for any given disease. The increase in sensitivity and reliability of measurements of gene expression changes can be improved through a systematic integration of different microarray datasets that address the same or similar biological questions.
Traditional effect size models can not be used to integrate array data that directly compare treatment to control samples expressed as log ratios of gene expressions. Here we extend the traditional effect size model to integrate as many array datasets as possible. The extended effect size model (MAID) can integrate any array datatype generated with either single or two channel arrays using either direct or indirect designs across different laboratories and platforms. The model uses two standardized indices, the standard effect size score for experiments with two groups of data, and a new standardized index that measures the difference in gene expression between treatment and control groups for one sample data with replicate arrays. The statistical significance of treatment effect across studies for each gene is determined by appropriate permutation methods depending on the type of data integrated. We apply our method to three different expression datasets from two different laboratories generated using three different array platforms and two different experimental designs. Our results indicate that the proposed integration model produces an increase in statistical power for identifying differentially expressed genes when integrating data across experiments and when compared to other integration models. We also show that genes found to be significant using our data integration method are of direct biological relevance to the three experiments integrated.
High-throughput genomics data provide a rich and complex source of information that could play a key role in deciphering intricate molecular networks behind disease. Here we propose an extension of the traditional effect size model to allow the integration of as many array experiments as possible with the aim of increasing the statistical power for identifying differentially expressed genes.
Microarray technology is becoming an important tool for biological research and clinical diagnostics , but it has the reputation of being noisy: studies addressing the reproducibility and reliability of microarray data across different laboratories and platforms have resulted in inconsistent results. Some have found agreement between experiments [2–7] while others have not [8–11]. A study by Irizarry et al.  on microarray data reproducibility has demonstrated that disagreement observed in some of the studies may be also due to questionable statistical analysis. There is general agreement that the variability inherent to DNA microarray technology is due to the following factors. There are a number of microarray platforms independently developed by industry and academia. Different protocols are used by different laboratories for RNA preparation and labeling. Different statistical and computational tools are used in the analysis of the microarray results. Due to these differences it is challenging to extract reproducible, biologically meaningful information from different DNA microarray experiments that address the same, or very similar biological questions. One possible solution to extract this information is to use meta-analysis methods that integrate the results of separate studies in a statistically meaningful manner. There are two main types of meta-analysis that are commonly used for microarray data integration. The first consists of integrating summary measures of gene expression measurements across studies. The advantage of this type of approach is that it avoids the need for estimating the inter-study variability and the issue of cross-platform normalization. Rhodes et al.  were the first to implement this type of approach. This group implemented a statistical model based on integrating p-values from individual studies to estimate the overall p-value for each gene across studies. The authors integrated four published prostate cancer gene array studies. Many of the genes identified were confirmed to be components of biologically relevant pathways, implying that the method extracted biologically useful information. Subsequently Parmigiani et al.  proposed a different model that uses a correlation-based method to search for consistent gene expression patterns across multiple studies. They demonstrated that their method can improve correlation of gene expressions across studies. Rather than combining p-values or correlations the second type of meta-analysis consists of integrating gene expression measures across studies. Choi et al.  were the first to propose this type of approach using an effect size measure  with a method that explicitly models the inter-study variability. Using the same datasets as those used in , they demonstrated that their method led to increased sensitivity and reliability. Subsequently Hu et al.  extended this model by incorporating a quality measure for each gene in each study into the effect size estimates. Using their model the authors combined two lung cancer Affymetrix datasets generated from two different laboratories and found that their method identifies more differentially expressed genes than previous methods. Taken together these studies suggest that a subset of biologically plausible and statistically significant genes can be determined from the integration of different array technologies. With an ever-increasing amount of microarray data being produced it is critical to develop statistically sound methods that will efficiently integrate, evaluate and cross-validate as many array experiments addressing the same biological question as possible. Even though progress has been made in integrating various array datasets, challenges still remain, one of which is that all the existing methods require experiments with two separate groups of data.
A two channel microarray technology continues to be used as one of the most common platforms for gene expression profiling [18, 19]. One experimental approach using two channel arrays is to directly compare levels of mRNA expression between treatment and control samples (also known as direct experimental design). Such experiments lead to datasets with only one group of gene expression ratios. The method proposed in  can not be applied to such datasets since it requires two groups of data. In order to allow the integration of as many datasets as possible, including experiments with one group of data, we extend the model proposed in  and propose a new mathematical framework for integrating microarray experiments with one group, two groups of data or mixed groups.
The model proposed in our study is more general than the model proposed in , and allows the integration of microarray data of any type generated across different laboratories, platforms and experimental designs. As such, it provides more flexibility for microarray data integration than the previously published effect size based model. The model provides also a new mathematical framework for addressing the inter-study variation for microarray data of different types.
In order to assess the usefulness of our model to integrate real data we applied our method to three different expression datasets generated from two different laboratories using three different 2-channel array platforms and two different experimental designs. All three datasets compared normal liver tissue to liver tissue chronically infected with hepatitis C virus (HCV).
In order to further assess the advantage of data integration, we decided to examine whether genes found in our analysis had direct biological relevance. Genes that are determined to have statistically significant expression level changes may still have low fold increases (or decreases) that might not be biologically relevant. Although there is no consensus in fold increase/decrease associated with 'biological relevance', we chose a fold change of at least 1.5 (increased or decreased) between HCV and normal in at least one of the three integrated studies based on the estimate of the median standard deviation (median sd = +/- 0.23) of fold gene expression measurements in the three experiments (a 1.5 fold cuto3 on gene expression levels is 2 standard deviations away from the mean of genes with no expression (i.e fold = 1), and thus is less likely to be confounded with non-expressed genes). We found a total of 206 genes to satisfy those criteria. Of the 206 genes, 79 genes were integration-driven discovery (IDD) genes as defined in . We have used a 1.5 fold cutoff in our previous array studies using clinical samples and have determined a number of biologically-relevant effects (Chen et al. , Borozan et al. , Chen et al. ), and were able to validate 85 % of genes expressed at the 1.5 fold level, using quantitative real time – PCR (for more detail about gene validation we refer the reader to [30, 31]).
Significantly over-represented GO biological processes
GO over-represented categories
organismal physiological process
response to biotic stimulus
response to stimulus
regulation of caspase activity
response to pest, pathogen or parasite
response to other organism
Significantly over-represented KEGG pathways
KEGG over-represented pathways
Antigen processing and presentation
Type I diabetes mellitus
Toll-like receptor signaling pathway
Linoleic acid metabolism
In this section we compare results from data integrated with MAID to results integrated with the other methods mentioned in the Background section. Among four proposed methods for microarray data integration [13–15, 17], only two methods based on combining summary measures (Rhodes et al.  and Parmigiani et al. ) can be applied to datasets with both one and two groups of data (a single group of data with two-color array technology can be produced using a direct design approach where disease and control samples are co-hybridized on the same array). The second index introduced in our model allows the general framework proposed in  to be extended and applied to datatypes that previously could not be integrated with this method.
The number of significant genes selected by each of the three integartion models with E(F P) = 10
# of genes
# of genes
# of genes
(Rhodes et al.)
(Parmigiani et al.)
E(FP) = 10
E(FP) = 10 and |fold| ≥ 1.5
Top five over-represented GO biological process categories obtained with three different integartion models
GO (BP) categories
# of genes
organismal physiological process
response to biotic stimulus
response to stimulus
Rhodes et al.
response to biotic stimulus
response to pest, pathogen or parasite
response to other organism
Parmigiani et al.
In Figure 6 we show the plot of the number of genes selected by each individual model versus the expected number of false positives. We found that the MAID model selects more genes when compared to the other two models for the same expected rate of false positives (i.e E(F P) = 10, see also Figure 6b). For the purpose of clarity Figure 6b shows the same plot as Figure 6a but limited to gene lists with the expected number of false positives E(F P) ≤ 21. The number of significant genes selected by each model is given in Table 3.
In order to asses whether the larger gene set selected by the MAID model (for the same expected false positive rate E(F P) = 10) enriches relevant biological categories we compared the enriched GO biological process (BP) categories obtained from gene lists selected by each individual model. We also imposed a threshold on selected genes' fold changes by requiring genes to be up (or down) regulated by |fold| ≥ 1.5 (for the reasons noted earlier) in HCV samples when compared to Normals (see Table 3). In Table 4 we give the top 5 enriched GO categories from each model.
As shown in Table 4 enriched GO (BP) categories obtained with a correlation-based method proposed by Parmigiani et al.  are less significant than categories obtained from either MAID or the model proposed by Rhodes et al. , and contain many fewer genes (no more than two per category) that show no clear relevance to the HCV disease.
Of the top five significantly enriched GO (BP) categories obtained with gene sets selected by the model proposed by Rhodes et al.  and MAID, two can clearly be associated with HCV disease; these are "immune response" and "defense response" (see Table 4). Table 4 shows that the enrichment in genes selected by the MAID model is higher for both of these categories "immune response" p-value = 4.96e-6 (MAID) vs p-value = 1.84e-5 (model from Rhodes et al. ) and "defense response" p-value = 1.81e-5 (MAID) vs p-value = 3.54e-5 (model from Rhodes et al. ). These results indicate that when gene sets selected by the model from Rhodes et al.  are compared by those selected by MAID, the larger MAID gene set improves the enrichment significance of the two of the most significant and HCV relevant GO categories and points to an increase in statistical power when compared to the model proposed by Rhodes et al. .
In this study we introduce a new effect size based model for microarray data integration. We demonstrate that our model, together with appropriate data pre-processing methods, can be used to integrate expression data across different laboratories, array platforms and experimental designs that results in an increase in statistical power for identifying differentially expressed genes when integrating data across experiments. Moreover, we show that genes selected as significant by our model enrich relevant biological pathways and processes.
In order to obtain the best possible results with our model, a number of important problems relating to each individual data set had to be addressed. First, it is only reasonable to integrate experiments that aim to address the same or similar biological questions. In order to address the problem of matching of samples and experiments, we integrated only experiments that compared samples of same biological type. Second, because most of the disagreement between individual array experiments was found to be due to platform-dependent probe effects , we decided to use only relative gene expression ratios instead of absolute measurements. Third, in order to ensure better agreement between gene annotations across platforms, we focused only on genes that had identical annotation entries in the NCBI Entrez Gene database.
After addressing the problem of matching of probes, samples and experimental conditions we used exploratory analysis methods proposed in  to determine if data from the three experiments presented any important systematic bias that would preclude their integration. We found all three datasets to show low correlation coefficients between their effect sizes – though a slightly higher correlation coefficient was found for datasets from the Washington group (see Figure 1). However, inspection of individual effect size distributions showed no fundamental differences between the three datasets (see Figure 2). Low correlations of effect sizes could result from a small group of genes showing similar effects across the three experiments. When expression measurements were integrated using the above methodology, we found 451 genes to be significantly expressed across all three studies with a false discovery rate (FDR) ≤ 0.05. Of these 237 had higher statistical significance in the integrated study than in any individual study. Of 79 integration driven discovery genes found with absolute fold expression greater than 1.5, 57 were shown to be up-regulated (or down-regulated) by at least 1.5 fold in only one of the three studies. This result suggests that the magnitude of fold increase (or decrease) in each individual experiment is a poor indicator of the overall gene activity when comparing across experiments and that a more suitable metric such as effect size needs to be used. Furthermore, of the 206 genes that were found to be significant (with fold ≥ 1.5) in our analysis, 11 were found not to be significant in any of the individual studies. The potential involvement in HCV disease of these genes identified through meta-analysis alone will require further biological study. Of four previously published methods proposed for microarray data integration [13–15, 17], two methods [13, 14], based on combining summary measures, can be applied to datasets generated with mixed groups (i.e with two groups and a single group of data). Comparing results obtained with MAID to results obtained with the models proposed by Rhodes et al.  and Parmigiani et al. , we found that MAID selects more genes than any of the summary statistics based methods, and that additional genes selected by MAID are relevant to the HCV disease. Genes selected by MAID produce an increase in enrichment of relevant HCV GO categories when compared to results obtained with the two summary statistics methods (see Table 4). These findings argue that MAID produces less conservative results that are also biologically more relevant, indicating an increase in statistical power.
The overlap in results of the top genes selected by each method (for exactly the same number of expected false positives) indicates that models based on integrating p-values  and effect sizes (i.e MAID), across experiments, give more similar results than the model based on integrating gene correlations .
Models based on summary statistics that integrate p values  or expression correlations  across studies can be used to obtain more precise estimates of significance of gene expressions than those obtained from the individual array studies (see for example ). However such approaches do not take into account the inter-study variability and can produce results that are significant even for genes that have significant fold changes but that are observed to be expressed in opposite directions (increased versus decreased) across studies. Models that do take the inter-study variability into account, such as Choi et al.  and MAID, would not consider such changes as significant (for example data integration using the model proposed by Rhodes et al.  leads to 19 genes that are significant but for which the fold increase/decrease is directed in opposite directions by at least 1.5 fold in at least one of the studies). In addition to ignoring the direction of change in gene expressions across studies, summary-statistics based models do not take the magnitude of observed effects (i.e fold changes) into account either. In this way significant statistical changes (or small p values) might not necessarily correspond to important biological effect (i.e fold changes) and could inflate the number of false positives. Effect size based models instead, integrate data directly by taking into account the magnitude of the effect and its consistency both within and across studies. Moreover it has been shown that models based on integrating summary statics are less sensitive to small but consistent expression changes than an effect size based model (see Choi et al. ).
Though we agree in principle with the approach proposed in Choi et al. , we note that the model assumes that a fixed or random effect model should be fitted for all the genes. However, this approach might not always be appropriate. As pointed out in , it is more likely that for some genes there would be no effect observed, while for others a fixed or random effect model would be more appropriate. A more flexible approach should improve the sensitivity and reliability of this model. Furthermore, as noted in , for microarray data and biological systems in general, genes can not always be assumed to act independently, but often show dependency through interactions and correlations. Without a better understanding of gene-gene interaction structures, it is difficult to realize how such improvements could be included in the model. We also note that particular care needs to be taken when integrating many small-sized microarray studies with this model as the estimated between study variability τ 2 will be biased and would influence overall results [20, 28].
The approach proposed in our study differs from that of  and the GeneMeta algorithm  in several important aspects. The set of methods proposed in , as implemented in the GeneMeta  algorithm, can only be applied to experiments with two separate groups of data and thus can not be applied to two-channel microarray experiments measuring differences in gene expression values between treatment and control groups using a direct experimental design. In order to integrate as many microarray datasets from the public domain as possible we proposed a new integration method which we implemented in form of the R package MAID (we have made every effort possible to provide an R package with an easy to understand, high-quality documentation for non-expert R users, the package is available upon request from the corresponding authors and will be submitted to the Bioconductor  project to ease access and dissemination).
In MAID the type of analysis applied depends on the type of data analyzed. Thus for microarray experiments with two groups of data we use the standard effect size model proposed in . For microarray experiments with one group of data we propose a second standardized index based on the paired t-statistic (see eq.6 in Methods section) which follows a Student's t-distribution times , with (n - 1) degrees of freedom (where n is the number of microarray replicates).
In addition to eq.6 (see Methods section) we also propose new estimators for both the pooled standard deviation (which is now given in eq.7 and which replaces the pooled standard deviation given in eqs.2–3 in the Choi et al. model) and the estimated variance (which is now given in eq.8 and which replaces the estimated variance for the unbiased effect size given in eq.5 in the Choi et al.  model).
Although we adapt the same general hierarchical model framework as described in Choi et al. , a major difference is that for direct design experiments the inter-study variability given in eq.12 (first proposed by DerSimonian and Laird ) is calculated using new expressions for the pooled standard deviation and the estimated variance given in eq.7 and eq.8, instead of the expression given by Choi et al. in eq.3 and eq.5 (see Methods section).
The same changes occur in eqs.9–15 with new estimators replacing those described in Choi et al . Depending on the type of datasets integrated the homogeneity test is calculated using either one or both types of standardized indices and their respective variances. MAID implements a permutation method that is specific for each data type, experiments with two groups of data are considered as a two class label case, while experiments with one group of data are considered as a one class label case. In addition to the permutation method for a two class label case, MAID implements a second permutation method (a feature which did not exist in the model proposed by Choi et al.) for a single class label case necessary in the calculation of false discovery rate (FDR) (see eq.16 in Methods section). Without the proposed new estimators given in eqs.6–8 (see Methods section) and their implementation through eqs.10–16 (see Methods section) it would not have been possible to integrate array experiments with both direct and indirect designs using a more sophisticated model, such as the one proposed in this study that takes both the intra and inter-study variability into account.
Traditional effect size models  are limited to integration of array datasets with two groups of data. Here we extend the traditional effect size model in order to increase the sample size by allowing the integration of array experiments of any type. Using our model we have shown that it is possible to detect small but consistent changes in gene expression across these three biologically similar but independent studies. Genes with weak signals in each individual experiment can be seen as potential false negatives. We have shown that the number of false negatives can be decreased effectively by using our model. We have also demonstrated that a sizable number of genes could be cross-validated through inter-study comparison indicating that these studies show a certain degree of reproducibility. Our results also indicate that technical and biological variability present in datasets obtained from different laboratories, different platforms and designs can be overcome by appropriate data pre-processing and meta-analysis methods. By comparing our model to other integration methods available, we show that our model selects more genes (for the same number of expected false positives) that are of direct biological relevance to the experiments under consideration.
Finally we have shown that most of the genes found to improve in significance after data integration with our model are of direct biological relevance to the three experiments. High-throughput proteomics and genomics data provide a rich and complex source of information which may help to decipher the complex molecular networks behind disease. Beyond the analysis of the gene expression data presented in this study, our model provides a way of integrating multiple microarray datasets across a broad range of cross-platform studies, and allows a more general and flexible framework for microarray data integration.
Three microarray expression datasets from two laboratories were collected. Two datasets were obtained from the University of Washington. These datasets were collected using two different versions of Agilent array technology. One dataset was generated using two-channel Agilent Human 1 cDNA array platform containing 12,814 probes. This study used a direct design and included 13 chronic HCV samples co-hybridized with 13 normal samples. The second dataset was generated using two-channel Agilent Human 1A (V2) 60-mer oligonucleotide array platform containing 20,173 probes. This study used direct design and included 5 chronic HCV samples co-hybridized with 5 normal samples. The third dataset was obtained from the University of Toronto UHN Microarray Liver Disease Project [30, 31]. This dataset was generated using two-color UHN cDNA microarray slides containing 19000 probes. This study used indirect design and included 40 chronic HCV samples co-hybridized to reference RNA and 20 normal samples co-hybridized to the same reference RNA. In total 78 samples were collected across the three studies. All arrays from University of Washington group were normalized using the Rosetta Resolver error model  while all arrays from University of Toronto were normalized using an R-based, intensity-dependent LOWESS scatter plot smoother (see the Methods section of ). Prior to data-integration all expression data were log2 transformed.
where p j is the gene specific p-value for the j th study. The summary S statistics is then compared to an empirical distribution, obtained by computing summary statistics S from 100000 random permutations of p-values from each study. The meta-analysis p-value are computed as the proportion of random S statistics larger than the actual S statistics. To estimate the false discovery rate we used the R package qvalue [21, 33] with the λ parameter set to zero that produces an estimate of FDR according to the methodology proposed by Benjamini and Hochberg .
and n t and n c are individual sample sizes for treatment and control groups.
where S 1 and S 2 are standard deviations of treatment and control groups.
where n = n t + n c .
where is the mean difference between treatments and control for one sample data, is the sample variance and t paired is the Student t quantile with (n - 1) degrees of freedom.
where ν designate the number of degrees of freedom (ν = n - 1). The null hypothesis H0 tested by t paired is thus that of no differences between treatments and control for one sample data (i.e H0:μ = 0). We note that for studies with direct design n t and n c denote the number of co-hybridized treatment and control samples for each one of the Cy5 and Cy3 channels with n t = n c = n, where n designate the total number of array replicates.
In our implementation the correct specification of the class labels depends on the type of data analyzed. Thus on the basis of class labels specified, our algorithm identifies the two types of data automatically. Experiments using two channel arrays with direct design correspond to the one-class label case while experiments with two groups of data correspond to the two class label case. Depending on the data type given, a t-statistic is calculated using either a two sample Welch t-statistic for the two class label case or a paired t-statistic for a single class label case. In both cases the t-statistics is calculated using the mt.teststat.num.denum() function from the R package multtest .
where y j is the observed effect size in study j, θ j is the mean gene expression in study j, μ is the average measure of differential expression for each gene across datasets, τ 2 is the estimated between study variability and s j is the estimated within-study variance.
Let denotes the observed unbiased effect size in study j for the two group data case and denotes the observed standardized index in study j for the one group data case. In our implementation y j from eq.9, designate either (see eq.4) for the two group data case or (see eq.6) for the one group data case. In the same way s j is calculated using either the expression given in eq.5 for the two group data case or the expression given in eq.8 for the one group data case. For the rest of this section, depending on data type to be integrated, y j and s j will designate either the observed effect size given eq.4 and its variance given in eq.5 or the observed standardized score given in eq.6 and its variance given in eq.8.
Under the null hypothesis that τ 2 = 0, the statistics Q follows a distribution where l designates the total number of experiments.
In order to estimate the statistical significance of integrated results a permutation-based method developed by Tusher et al.  was used. In our model the permutation method used for estimating the false discovery rate (FDR) depends on the type of class labels provided. For the single class labels the permutation method is based on the paired t-statistic while for the two class label case the Welch t-statistic is used.
IB is supported by the National Canadian Research Training Program in Hepatitis C (NCRTP-HEPC). IDM is supported by Canadian Institutes of Health Research (CIHR). ZZ is supported by Genome Canada through the Ontario Genomics Institute. We thank Kathie Walters in the laboratory of Michael Katze for helpful discussion and Maggie Shuhart who provided the original samples.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.