In this study we introduce a new effect size based model for microarray data integration. We demonstrate that our model, together with appropriate data pre-processing methods, can be used to integrate expression data across different laboratories, array platforms and experimental designs that results in an increase in statistical power for identifying differentially expressed genes when integrating data across experiments. Moreover, we show that genes selected as significant by our model enrich relevant biological pathways and processes.

In order to obtain the best possible results with our model, a number of important problems relating to each individual data set had to be addressed. First, it is only reasonable to integrate experiments that aim to address the same or similar biological questions. In order to address the problem of matching of samples and experiments, we integrated only experiments that compared samples of same biological type. Second, because most of the disagreement between individual array experiments was found to be due to platform-dependent probe effects [12], we decided to use only relative gene expression ratios instead of absolute measurements. Third, in order to ensure better agreement between gene annotations across platforms, we focused only on genes that had identical annotation entries in the NCBI Entrez Gene database.

After addressing the problem of matching of probes, samples and experimental conditions we used exploratory analysis methods proposed in [20] to determine if data from the three experiments presented any important systematic bias that would preclude their integration. We found all three datasets to show low correlation coefficients between their effect sizes – though a slightly higher correlation coefficient was found for datasets from the Washington group (see Figure 1). However, inspection of individual effect size distributions showed no fundamental differences between the three datasets (see Figure 2). Low correlations of effect sizes could result from a small group of genes showing similar effects across the three experiments. When expression measurements were integrated using the above methodology, we found 451 genes to be significantly expressed across all three studies with a false discovery rate (FDR) ≤ 0.05. Of these 237 had higher statistical significance in the integrated study than in any individual study. Of 79 integration driven discovery genes found with absolute fold expression greater than 1.5, 57 were shown to be up-regulated (or down-regulated) by at least 1.5 fold in only one of the three studies. This result suggests that the magnitude of fold increase (or decrease) in each individual experiment is a poor indicator of the overall gene activity when comparing across experiments and that a more suitable metric such as effect size needs to be used. Furthermore, of the 206 genes that were found to be significant (with fold ≥ 1.5) in our analysis, 11 were found not to be significant in any of the individual studies. The potential involvement in HCV disease of these genes identified through meta-analysis alone will require further biological study. Of four previously published methods proposed for microarray data integration [13–15, 17], two methods [13, 14], based on combining summary measures, can be applied to datasets generated with mixed groups (i.e with two groups and a single group of data). Comparing results obtained with MAID to results obtained with the models proposed by Rhodes et al. [13] and Parmigiani et al. [14], we found that MAID selects more genes than any of the summary statistics based methods, and that additional genes selected by MAID are relevant to the HCV disease. Genes selected by MAID produce an increase in enrichment of relevant HCV GO categories when compared to results obtained with the two summary statistics methods (see Table 4). These findings argue that MAID produces less conservative results that are also biologically more relevant, indicating an increase in statistical power.

The overlap in results of the top genes selected by each method (for exactly the same number of expected false positives) indicates that models based on integrating p-values [13] and effect sizes (i.e MAID), across experiments, give more similar results than the model based on integrating gene correlations [14].

Models based on summary statistics that integrate p values [13] or expression correlations [14] across studies can be used to obtain more precise estimates of significance of gene expressions than those obtained from the individual array studies (see for example [13]). However such approaches do not take into account the inter-study variability and can produce results that are significant even for genes that have significant fold changes but that are observed to be expressed in opposite directions (increased versus decreased) across studies. Models that do take the inter-study variability into account, such as Choi et al. [15] and MAID, would not consider such changes as significant (for example data integration using the model proposed by Rhodes et al. [13] leads to 19 genes that are significant but for which the fold increase/decrease is directed in opposite directions by at least 1.5 fold in at least one of the studies). In addition to ignoring the direction of change in gene expressions across studies, summary-statistics based models do not take the magnitude of observed effects (i.e fold changes) into account either. In this way significant statistical changes (or small p values) might not necessarily correspond to important biological effect (i.e fold changes) and could inflate the number of false positives. Effect size based models instead, integrate data directly by taking into account the magnitude of the effect and its consistency both within and across studies. Moreover it has been shown that models based on integrating summary statics are less sensitive to small but consistent expression changes than an effect size based model (see Choi et al. [15]).

Though we agree in principle with the approach proposed in Choi et al. [15], we note that the model assumes that a fixed or random effect model should be fitted for all the genes. However, this approach might not always be appropriate. As pointed out in [20], it is more likely that for some genes there would be no effect observed, while for others a fixed or random effect model would be more appropriate. A more flexible approach should improve the sensitivity and reliability of this model. Furthermore, as noted in [15], for microarray data and biological systems in general, genes can not always be assumed to act independently, but often show dependency through interactions and correlations. Without a better understanding of gene-gene interaction structures, it is difficult to realize how such improvements could be included in the model. We also note that particular care needs to be taken when integrating many small-sized microarray studies with this model as the estimated between study variability *τ*
^{2} will be biased and would influence overall results [20, 28].

The approach proposed in our study differs from that of [15] and the GeneMeta algorithm [21] in several important aspects. The set of methods proposed in [15], as implemented in the GeneMeta [21] algorithm, can only be applied to experiments with two separate groups of data and thus can not be applied to two-channel microarray experiments measuring differences in gene expression values between treatment and control groups using a direct experimental design. In order to integrate as many microarray datasets from the public domain as possible we proposed a new integration method which we implemented in form of the R package MAID (we have made every effort possible to provide an R package with an easy to understand, high-quality documentation for non-expert R users, the package is available upon request from the corresponding authors and will be submitted to the Bioconductor [21] project to ease access and dissemination).

In MAID the type of analysis applied depends on the type of data analyzed. Thus for microarray experiments with two groups of data we use the standard effect size model proposed in [15]. For microarray experiments with one group of data we propose a second standardized index based on the paired *t*-statistic (see eq.6 in Methods section) which follows a Student's *t*-distribution times
, with (*n* - 1) degrees of freedom (where n is the number of microarray replicates).

In addition to eq.6 (see Methods section) we also propose new estimators for both the pooled standard deviation (which is now given in eq.7 and which replaces the pooled standard deviation given in eqs.2–3 in the Choi et al. model) and the estimated variance (which is now given in eq.8 and which replaces the estimated variance for the unbiased effect size given in eq.5 in the Choi et al. [15] model).

Although we adapt the same general hierarchical model framework as described in Choi et al. [15], a major difference is that for direct design experiments the inter-study variability given in eq.12 (first proposed by DerSimonian and Laird [29]) is calculated using new expressions for the pooled standard deviation and the estimated variance given in eq.7 and eq.8, instead of the expression given by Choi et al. in eq.3 and eq.5 (see Methods section).

The same changes occur in eqs.9–15 with new estimators replacing those described in Choi et al [15]. Depending on the type of datasets integrated the homogeneity test is calculated using either one or both types of standardized indices and their respective variances. MAID implements a permutation method that is specific for each data type, experiments with two groups of data are considered as a two class label case, while experiments with one group of data are considered as a one class label case. In addition to the permutation method for a two class label case, MAID implements a second permutation method (a feature which did not exist in the model proposed by Choi et al.) for a single class label case necessary in the calculation of false discovery rate (FDR) (see eq.16 in Methods section). Without the proposed new estimators given in eqs.6–8 (see Methods section) and their implementation through eqs.10–16 (see Methods section) it would not have been possible to integrate array experiments with both direct and indirect designs using a more sophisticated model, such as the one proposed in this study that takes both the intra and inter-study variability into account.