MAID : An effect size based model for microarray data integration across laboratories and platforms
- Ivan Borozan^{1}Email author,
- Limin Chen^{1, 2},
- Bryan Paeper^{7},
- Jenny E Heathcote^{6},
- Aled M Edwards^{1, 2, 3},
- Michael Katze^{7},
- Zhaolei Zhang^{1, 2} and
- Ian D McGilvray^{4, 5}Email author
DOI: 10.1186/1471-2105-9-305
© Borozan et al; licensee BioMed Central Ltd. 2008
Received: 27 February 2008
Accepted: 10 July 2008
Published: 10 July 2008
Abstract
Background
Gene expression profiling has the potential to unravel molecular mechanisms behind gene regulation and identify gene targets for therapeutic interventions. As microarray technology matures, the number of microarray studies has increased, resulting in many different datasets available for any given disease. The increase in sensitivity and reliability of measurements of gene expression changes can be improved through a systematic integration of different microarray datasets that address the same or similar biological questions.
Results
Traditional effect size models can not be used to integrate array data that directly compare treatment to control samples expressed as log ratios of gene expressions. Here we extend the traditional effect size model to integrate as many array datasets as possible. The extended effect size model (MAID) can integrate any array datatype generated with either single or two channel arrays using either direct or indirect designs across different laboratories and platforms. The model uses two standardized indices, the standard effect size score for experiments with two groups of data, and a new standardized index that measures the difference in gene expression between treatment and control groups for one sample data with replicate arrays. The statistical significance of treatment effect across studies for each gene is determined by appropriate permutation methods depending on the type of data integrated. We apply our method to three different expression datasets from two different laboratories generated using three different array platforms and two different experimental designs. Our results indicate that the proposed integration model produces an increase in statistical power for identifying differentially expressed genes when integrating data across experiments and when compared to other integration models. We also show that genes found to be significant using our data integration method are of direct biological relevance to the three experiments integrated.
Conclusion
High-throughput genomics data provide a rich and complex source of information that could play a key role in deciphering intricate molecular networks behind disease. Here we propose an extension of the traditional effect size model to allow the integration of as many array experiments as possible with the aim of increasing the statistical power for identifying differentially expressed genes.
Background
Microarray technology is becoming an important tool for biological research and clinical diagnostics [1], but it has the reputation of being noisy: studies addressing the reproducibility and reliability of microarray data across different laboratories and platforms have resulted in inconsistent results. Some have found agreement between experiments [2–7] while others have not [8–11]. A study by Irizarry et al. [12] on microarray data reproducibility has demonstrated that disagreement observed in some of the studies may be also due to questionable statistical analysis. There is general agreement that the variability inherent to DNA microarray technology is due to the following factors. There are a number of microarray platforms independently developed by industry and academia. Different protocols are used by different laboratories for RNA preparation and labeling. Different statistical and computational tools are used in the analysis of the microarray results. Due to these differences it is challenging to extract reproducible, biologically meaningful information from different DNA microarray experiments that address the same, or very similar biological questions. One possible solution to extract this information is to use meta-analysis methods that integrate the results of separate studies in a statistically meaningful manner. There are two main types of meta-analysis that are commonly used for microarray data integration. The first consists of integrating summary measures of gene expression measurements across studies. The advantage of this type of approach is that it avoids the need for estimating the inter-study variability and the issue of cross-platform normalization. Rhodes et al. [13] were the first to implement this type of approach. This group implemented a statistical model based on integrating p-values from individual studies to estimate the overall p-value for each gene across studies. The authors integrated four published prostate cancer gene array studies. Many of the genes identified were confirmed to be components of biologically relevant pathways, implying that the method extracted biologically useful information. Subsequently Parmigiani et al. [14] proposed a different model that uses a correlation-based method to search for consistent gene expression patterns across multiple studies. They demonstrated that their method can improve correlation of gene expressions across studies. Rather than combining p-values or correlations the second type of meta-analysis consists of integrating gene expression measures across studies. Choi et al. [15] were the first to propose this type of approach using an effect size measure [16] with a method that explicitly models the inter-study variability. Using the same datasets as those used in [13], they demonstrated that their method led to increased sensitivity and reliability. Subsequently Hu et al. [17] extended this model by incorporating a quality measure for each gene in each study into the effect size estimates. Using their model the authors combined two lung cancer Affymetrix datasets generated from two different laboratories and found that their method identifies more differentially expressed genes than previous methods. Taken together these studies suggest that a subset of biologically plausible and statistically significant genes can be determined from the integration of different array technologies. With an ever-increasing amount of microarray data being produced it is critical to develop statistically sound methods that will efficiently integrate, evaluate and cross-validate as many array experiments addressing the same biological question as possible. Even though progress has been made in integrating various array datasets, challenges still remain, one of which is that all the existing methods require experiments with two separate groups of data.
A two channel microarray technology continues to be used as one of the most common platforms for gene expression profiling [18, 19]. One experimental approach using two channel arrays is to directly compare levels of mRNA expression between treatment and control samples (also known as direct experimental design). Such experiments lead to datasets with only one group of gene expression ratios. The method proposed in [15] can not be applied to such datasets since it requires two groups of data. In order to allow the integration of as many datasets as possible, including experiments with one group of data, we extend the model proposed in [15] and propose a new mathematical framework for integrating microarray experiments with one group, two groups of data or mixed groups.
The model proposed in our study is more general than the model proposed in [15], and allows the integration of microarray data of any type generated across different laboratories, platforms and experimental designs. As such, it provides more flexibility for microarray data integration than the previously published effect size based model. The model provides also a new mathematical framework for addressing the inter-study variation for microarray data of different types.
Results
In order to assess the usefulness of our model to integrate real data we applied our method to three different expression datasets generated from two different laboratories using three different 2-channel array platforms and two different experimental designs. All three datasets compared normal liver tissue to liver tissue chronically infected with hepatitis C virus (HCV).
Exploratory data analysis
Genes determined to be significant in the meta-analysis model
In order to further assess the advantage of data integration, we decided to examine whether genes found in our analysis had direct biological relevance. Genes that are determined to have statistically significant expression level changes may still have low fold increases (or decreases) that might not be biologically relevant. Although there is no consensus in fold increase/decrease associated with 'biological relevance', we chose a fold change of at least 1.5 (increased or decreased) between HCV and normal in at least one of the three integrated studies based on the estimate of the median standard deviation (median sd = +/- 0.23) of fold gene expression measurements in the three experiments (a 1.5 fold cuto3 on gene expression levels is 2 standard deviations away from the mean of genes with no expression (i.e fold = 1), and thus is less likely to be confounded with non-expressed genes). We found a total of 206 genes to satisfy those criteria. Of the 206 genes, 79 genes were integration-driven discovery (IDD) genes as defined in [15]. We have used a 1.5 fold cutoff in our previous array studies using clinical samples and have determined a number of biologically-relevant effects (Chen et al. [31], Borozan et al. [30], Chen et al. [37]), and were able to validate 85 % of genes expressed at the 1.5 fold level, using quantitative real time – PCR (for more detail about gene validation we refer the reader to [30, 31]).
Biological pathway analysis
Significantly over-represented GO biological processes
GO over-represented categories | GOBPID | P value | OddsRatio | ExpCount | Count | Size |
---|---|---|---|---|---|---|
immune response | GO:0006955 | 8.39E-008 | 3.56 | 11.25 | 31 | 196 |
organismal physiological process | GO:0050874 | 1.35E-007 | 2.69 | 24.44 | 50 | 426 |
response to biotic stimulus | GO:0009607 | 4.71E-007 | 3.18 | 12.74 | 32 | 222 |
defense response | GO:0006952 | 4.76E-007 | 3.24 | 12.11 | 31 | 211 |
response to stimulus | GO:0050896 | 2.75E-005 | 2.15 | 28.23 | 49 | 492 |
regulation of caspase activity | GO:0043281 | 3.20E-004 | 22.41 | 0.4 | 4 | 7 |
response to pest, pathogen or parasite | GO:0009613 | 1.01E-003 | 2.72 | 6.83 | 16 | 119 |
response to other organism | GO:0051707 | 1.32E-003 | 2.64 | 7 | 16 | 122 |
physiological process | GO:0007582 | 4.77E-003 | 2.2 | 146.25 | 157 | 2549 |
Significantly over-represented KEGG pathways
KEGG over-represented pathways | P value |
---|---|
Antigen processing and presentation | 8.16E-005 |
Type I diabetes mellitus | 1.26E-003 |
Ribosome | 3.76E-003 |
Toll-like receptor signaling pathway | 4.40E-002 |
Linoleic acid metabolism | 4.65E-002 |
Comparison of MAID with other integration methods
In this section we compare results from data integrated with MAID to results integrated with the other methods mentioned in the Background section. Among four proposed methods for microarray data integration [13–15, 17], only two methods based on combining summary measures (Rhodes et al. [13] and Parmigiani et al. [14]) can be applied to datasets with both one and two groups of data (a single group of data with two-color array technology can be produced using a direct design approach where disease and control samples are co-hybridized on the same array). The second index introduced in our model allows the general framework proposed in [15] to be extended and applied to datatypes that previously could not be integrated with this method.
The number of significant genes selected by each of the three integartion models with E(F P) = 10
# of genes | # of genes | # of genes | |
---|---|---|---|
(Rhodes et al.) | (Parmigiani et al.) | (MAID) | |
E(FP) = 10 | 37 | 50 | 350 |
E(FP) = 10 and |fold| ≥ 1.5 | 37 | 14 | 159 |
Top five over-represented GO biological process categories obtained with three different integartion models
GO (BP) categories | P values | # of genes |
---|---|---|
MAID | ||
organismal physiological process | 8.90E-007 | 41 |
immune response | 4.96E-006 | 24 |
response to biotic stimulus | 1.41E-005 | 25 |
defense response | 1.81E-005 | 24 |
response to stimulus | 1.98E-004 | 39 |
Rhodes et al. | ||
response to biotic stimulus | 8.12E-006 | 11 |
immune response | 1.84E-005 | 10 |
response to pest, pathogen or parasite | 2.12E-005 | 8 |
response to other organism | 2.54E-005 | 8 |
defense response | 3.54E-005 | 10 |
Parmigiani et al. | ||
glyoxylate metabolism | 4.44E-003 | 1 |
sphingolipid catabolism | 4.44E-003 | 1 |
sphingomyelin metabolism | 4.44E-003 | 1 |
sphingomyelin catabolism | 4.44E-003 | 1 |
coagulation | 8.37E-003 | 2 |
In Figure 6 we show the plot of the number of genes selected by each individual model versus the expected number of false positives. We found that the MAID model selects more genes when compared to the other two models for the same expected rate of false positives (i.e E(F P) = 10, see also Figure 6b). For the purpose of clarity Figure 6b shows the same plot as Figure 6a but limited to gene lists with the expected number of false positives E(F P) ≤ 21. The number of significant genes selected by each model is given in Table 3.
In order to asses whether the larger gene set selected by the MAID model (for the same expected false positive rate E(F P) = 10) enriches relevant biological categories we compared the enriched GO biological process (BP) categories obtained from gene lists selected by each individual model. We also imposed a threshold on selected genes' fold changes by requiring genes to be up (or down) regulated by |fold| ≥ 1.5 (for the reasons noted earlier) in HCV samples when compared to Normals (see Table 3). In Table 4 we give the top 5 enriched GO categories from each model.
As shown in Table 4 enriched GO (BP) categories obtained with a correlation-based method proposed by Parmigiani et al. [14] are less significant than categories obtained from either MAID or the model proposed by Rhodes et al. [13], and contain many fewer genes (no more than two per category) that show no clear relevance to the HCV disease.
Of the top five significantly enriched GO (BP) categories obtained with gene sets selected by the model proposed by Rhodes et al. [13] and MAID, two can clearly be associated with HCV disease; these are "immune response" and "defense response" (see Table 4). Table 4 shows that the enrichment in genes selected by the MAID model is higher for both of these categories "immune response" p-value = 4.96e-6 (MAID) vs p-value = 1.84e-5 (model from Rhodes et al. [13]) and "defense response" p-value = 1.81e-5 (MAID) vs p-value = 3.54e-5 (model from Rhodes et al. [13]). These results indicate that when gene sets selected by the model from Rhodes et al. [13] are compared by those selected by MAID, the larger MAID gene set improves the enrichment significance of the two of the most significant and HCV relevant GO categories and points to an increase in statistical power when compared to the model proposed by Rhodes et al. [13].
Discussion
In this study we introduce a new effect size based model for microarray data integration. We demonstrate that our model, together with appropriate data pre-processing methods, can be used to integrate expression data across different laboratories, array platforms and experimental designs that results in an increase in statistical power for identifying differentially expressed genes when integrating data across experiments. Moreover, we show that genes selected as significant by our model enrich relevant biological pathways and processes.
In order to obtain the best possible results with our model, a number of important problems relating to each individual data set had to be addressed. First, it is only reasonable to integrate experiments that aim to address the same or similar biological questions. In order to address the problem of matching of samples and experiments, we integrated only experiments that compared samples of same biological type. Second, because most of the disagreement between individual array experiments was found to be due to platform-dependent probe effects [12], we decided to use only relative gene expression ratios instead of absolute measurements. Third, in order to ensure better agreement between gene annotations across platforms, we focused only on genes that had identical annotation entries in the NCBI Entrez Gene database.
After addressing the problem of matching of probes, samples and experimental conditions we used exploratory analysis methods proposed in [20] to determine if data from the three experiments presented any important systematic bias that would preclude their integration. We found all three datasets to show low correlation coefficients between their effect sizes – though a slightly higher correlation coefficient was found for datasets from the Washington group (see Figure 1). However, inspection of individual effect size distributions showed no fundamental differences between the three datasets (see Figure 2). Low correlations of effect sizes could result from a small group of genes showing similar effects across the three experiments. When expression measurements were integrated using the above methodology, we found 451 genes to be significantly expressed across all three studies with a false discovery rate (FDR) ≤ 0.05. Of these 237 had higher statistical significance in the integrated study than in any individual study. Of 79 integration driven discovery genes found with absolute fold expression greater than 1.5, 57 were shown to be up-regulated (or down-regulated) by at least 1.5 fold in only one of the three studies. This result suggests that the magnitude of fold increase (or decrease) in each individual experiment is a poor indicator of the overall gene activity when comparing across experiments and that a more suitable metric such as effect size needs to be used. Furthermore, of the 206 genes that were found to be significant (with fold ≥ 1.5) in our analysis, 11 were found not to be significant in any of the individual studies. The potential involvement in HCV disease of these genes identified through meta-analysis alone will require further biological study. Of four previously published methods proposed for microarray data integration [13–15, 17], two methods [13, 14], based on combining summary measures, can be applied to datasets generated with mixed groups (i.e with two groups and a single group of data). Comparing results obtained with MAID to results obtained with the models proposed by Rhodes et al. [13] and Parmigiani et al. [14], we found that MAID selects more genes than any of the summary statistics based methods, and that additional genes selected by MAID are relevant to the HCV disease. Genes selected by MAID produce an increase in enrichment of relevant HCV GO categories when compared to results obtained with the two summary statistics methods (see Table 4). These findings argue that MAID produces less conservative results that are also biologically more relevant, indicating an increase in statistical power.
The overlap in results of the top genes selected by each method (for exactly the same number of expected false positives) indicates that models based on integrating p-values [13] and effect sizes (i.e MAID), across experiments, give more similar results than the model based on integrating gene correlations [14].
Models based on summary statistics that integrate p values [13] or expression correlations [14] across studies can be used to obtain more precise estimates of significance of gene expressions than those obtained from the individual array studies (see for example [13]). However such approaches do not take into account the inter-study variability and can produce results that are significant even for genes that have significant fold changes but that are observed to be expressed in opposite directions (increased versus decreased) across studies. Models that do take the inter-study variability into account, such as Choi et al. [15] and MAID, would not consider such changes as significant (for example data integration using the model proposed by Rhodes et al. [13] leads to 19 genes that are significant but for which the fold increase/decrease is directed in opposite directions by at least 1.5 fold in at least one of the studies). In addition to ignoring the direction of change in gene expressions across studies, summary-statistics based models do not take the magnitude of observed effects (i.e fold changes) into account either. In this way significant statistical changes (or small p values) might not necessarily correspond to important biological effect (i.e fold changes) and could inflate the number of false positives. Effect size based models instead, integrate data directly by taking into account the magnitude of the effect and its consistency both within and across studies. Moreover it has been shown that models based on integrating summary statics are less sensitive to small but consistent expression changes than an effect size based model (see Choi et al. [15]).
Though we agree in principle with the approach proposed in Choi et al. [15], we note that the model assumes that a fixed or random effect model should be fitted for all the genes. However, this approach might not always be appropriate. As pointed out in [20], it is more likely that for some genes there would be no effect observed, while for others a fixed or random effect model would be more appropriate. A more flexible approach should improve the sensitivity and reliability of this model. Furthermore, as noted in [15], for microarray data and biological systems in general, genes can not always be assumed to act independently, but often show dependency through interactions and correlations. Without a better understanding of gene-gene interaction structures, it is difficult to realize how such improvements could be included in the model. We also note that particular care needs to be taken when integrating many small-sized microarray studies with this model as the estimated between study variability τ^{2} will be biased and would influence overall results [20, 28].
The approach proposed in our study differs from that of [15] and the GeneMeta algorithm [21] in several important aspects. The set of methods proposed in [15], as implemented in the GeneMeta [21] algorithm, can only be applied to experiments with two separate groups of data and thus can not be applied to two-channel microarray experiments measuring differences in gene expression values between treatment and control groups using a direct experimental design. In order to integrate as many microarray datasets from the public domain as possible we proposed a new integration method which we implemented in form of the R package MAID (we have made every effort possible to provide an R package with an easy to understand, high-quality documentation for non-expert R users, the package is available upon request from the corresponding authors and will be submitted to the Bioconductor [21] project to ease access and dissemination).
In MAID the type of analysis applied depends on the type of data analyzed. Thus for microarray experiments with two groups of data we use the standard effect size model proposed in [15]. For microarray experiments with one group of data we propose a second standardized index based on the paired t-statistic (see eq.6 in Methods section) which follows a Student's t-distribution times $\sqrt{\frac{1}{n}}$, with (n - 1) degrees of freedom (where n is the number of microarray replicates).
In addition to eq.6 (see Methods section) we also propose new estimators for both the pooled standard deviation (which is now given in eq.7 and which replaces the pooled standard deviation given in eqs.2–3 in the Choi et al. model) and the estimated variance (which is now given in eq.8 and which replaces the estimated variance for the unbiased effect size given in eq.5 in the Choi et al. [15] model).
Although we adapt the same general hierarchical model framework as described in Choi et al. [15], a major difference is that for direct design experiments the inter-study variability given in eq.12 (first proposed by DerSimonian and Laird [29]) is calculated using new expressions for the pooled standard deviation and the estimated variance given in eq.7 and eq.8, instead of the expression given by Choi et al. in eq.3 and eq.5 (see Methods section).
The same changes occur in eqs.9–15 with new estimators replacing those described in Choi et al [15]. Depending on the type of datasets integrated the homogeneity test is calculated using either one or both types of standardized indices and their respective variances. MAID implements a permutation method that is specific for each data type, experiments with two groups of data are considered as a two class label case, while experiments with one group of data are considered as a one class label case. In addition to the permutation method for a two class label case, MAID implements a second permutation method (a feature which did not exist in the model proposed by Choi et al.) for a single class label case necessary in the calculation of false discovery rate (FDR) (see eq.16 in Methods section). Without the proposed new estimators given in eqs.6–8 (see Methods section) and their implementation through eqs.10–16 (see Methods section) it would not have been possible to integrate array experiments with both direct and indirect designs using a more sophisticated model, such as the one proposed in this study that takes both the intra and inter-study variability into account.
Conclusion
Traditional effect size models [15] are limited to integration of array datasets with two groups of data. Here we extend the traditional effect size model in order to increase the sample size by allowing the integration of array experiments of any type. Using our model we have shown that it is possible to detect small but consistent changes in gene expression across these three biologically similar but independent studies. Genes with weak signals in each individual experiment can be seen as potential false negatives. We have shown that the number of false negatives can be decreased effectively by using our model. We have also demonstrated that a sizable number of genes could be cross-validated through inter-study comparison indicating that these studies show a certain degree of reproducibility. Our results also indicate that technical and biological variability present in datasets obtained from different laboratories, different platforms and designs can be overcome by appropriate data pre-processing and meta-analysis methods. By comparing our model to other integration methods available, we show that our model selects more genes (for the same number of expected false positives) that are of direct biological relevance to the experiments under consideration.
Finally we have shown that most of the genes found to improve in significance after data integration with our model are of direct biological relevance to the three experiments. High-throughput proteomics and genomics data provide a rich and complex source of information which may help to decipher the complex molecular networks behind disease. Beyond the analysis of the gene expression data presented in this study, our model provides a way of integrating multiple microarray datasets across a broad range of cross-platform studies, and allows a more general and flexible framework for microarray data integration.
Methods
Data sources and preprocessing
Three microarray expression datasets from two laboratories were collected. Two datasets were obtained from the University of Washington. These datasets were collected using two different versions of Agilent array technology. One dataset was generated using two-channel Agilent Human 1 cDNA array platform containing 12,814 probes. This study used a direct design and included 13 chronic HCV samples co-hybridized with 13 normal samples. The second dataset was generated using two-channel Agilent Human 1A (V2) 60-mer oligonucleotide array platform containing 20,173 probes. This study used direct design and included 5 chronic HCV samples co-hybridized with 5 normal samples. The third dataset was obtained from the University of Toronto UHN Microarray Liver Disease Project [30, 31]. This dataset was generated using two-color UHN cDNA microarray slides containing 19000 probes. This study used indirect design and included 40 chronic HCV samples co-hybridized to reference RNA and 20 normal samples co-hybridized to the same reference RNA. In total 78 samples were collected across the three studies. All arrays from University of Washington group were normalized using the Rosetta Resolver error model [32] while all arrays from University of Toronto were normalized using an R-based, intensity-dependent LOWESS scatter plot smoother (see the Methods section of [31]). Prior to data-integration all expression data were log2 transformed.
Annotation
Data integration with models proposed by Parmigiani et al. and Rhodes et al
We used the R package MergeMaid [21] to integrate the three dataset using the integration model proposed by Parmigiani et al. [14].
where p_{ j }is the gene specific p-value for the j^{ th }study. The summary S statistics is then compared to an empirical distribution, obtained by computing summary statistics S from 100000 random permutations of p-values from each study. The meta-analysis p-value are computed as the proportion of random S statistics larger than the actual S statistics. To estimate the false discovery rate we used the R package qvalue [21, 33] with the λ parameter set to zero that produces an estimate of FDR according to the methodology proposed by Benjamini and Hochberg [34].
Microarray Data Integration Model (MAID)
Effect size
and n_{ t }and n_{ c }are individual sample sizes for treatment and control groups.
where S_{1} and S_{2} are standard deviations of treatment and control groups.
where n = n_{ t }+ n_{ c }.
where $\overline{X}$ is the mean difference between treatments and control for one sample data, ${\sigma}_{paired}^{2}$ is the sample variance and t_{ paired }is the Student t quantile with (n - 1) degrees of freedom.
where ν designate the number of degrees of freedom (ν = n - 1). The null hypothesis H_{0} tested by t_{ paired }is thus that of no differences between treatments and control for one sample data (i.e H_{0}:μ = 0). We note that for studies with direct design n_{ t }and n_{ c }denote the number of co-hybridized treatment and control samples for each one of the Cy5 and Cy3 channels with n_{ t }= n_{ c }= n, where n designate the total number of array replicates.
In our implementation the correct specification of the class labels depends on the type of data analyzed. Thus on the basis of class labels specified, our algorithm identifies the two types of data automatically. Experiments using two channel arrays with direct design correspond to the one-class label case while experiments with two groups of data correspond to the two class label case. Depending on the data type given, a t-statistic is calculated using either a two sample Welch t-statistic for the two class label case or a paired t-statistic for a single class label case. In both cases the t-statistics is calculated using the mt.teststat.num.denum() function from the R package multtest [21].
Hierarchical model
where y_{ j }is the observed effect size in study j, θ_{ j }is the mean gene expression in study j, μ is the average measure of differential expression for each gene across datasets, τ^{2} is the estimated between study variability and s_{ j }is the estimated within-study variance.
Let ${{d}^{\prime}}_{j}$ denotes the observed unbiased effect size in study j for the two group data case and ${{d}^{\u2033}}_{j}$ denotes the observed standardized index in study j for the one group data case. In our implementation y_{ j }from eq.9, designate either ${{d}^{\prime}}_{j}$ (see eq.4) for the two group data case or ${{d}^{\u2033}}_{j}$ (see eq.6) for the one group data case. In the same way s_{ j }is calculated using either the expression given in eq.5 for the two group data case or the expression given in eq.8 for the one group data case. For the rest of this section, depending on data type to be integrated, y_{ j }and s_{ j }will designate either the observed effect size given eq.4 and its variance given in eq.5 or the observed standardized score given in eq.6 and its variance given in eq.8.
Under the null hypothesis that τ^{2} = 0, the statistics Q follows a ${\chi}_{(l-1)}^{2}$ distribution where l designates the total number of experiments.
In order to estimate the statistical significance of integrated results a permutation-based method developed by Tusher et al. [35] was used. In our model the permutation method used for estimating the false discovery rate (FDR) depends on the type of class labels provided. For the single class labels the permutation method is based on the paired t-statistic while for the two class label case the Welch t-statistic is used.
where z_{ th }is the threshold on the z score statistic [15] (see eq.15), and where I() equals 1 if the condition in parenthesis is true, and 0 otherwise.
Declarations
Acknowledgements
IB is supported by the National Canadian Research Training Program in Hepatitis C (NCRTP-HEPC). IDM is supported by Canadian Institutes of Health Research (CIHR). ZZ is supported by Genome Canada through the Ontario Genomics Institute. We thank Kathie Walters in the laboratory of Michael Katze for helpful discussion and Maggie Shuhart who provided the original samples.
Authors’ Affiliations
References
- Glas AnnuskaM, Arno Floore, Delahaye LeonieJMJ, Witteveen AnkeT, Pover RobCF, Niels Bakx, Lahti-Domenici JaanaST, Bruinsma TakoJ, Warmoes MarcO, René Bernards, Wessels LodewykFA, Van 't Veer LauraJ: Converting a breast cancer microarray signature into a high-throughput diagnostic test. BioMed Central Genomics 2006, 278(7):2164–2167. 30 October 2006
- Kane M, Jatkoe T, Stumpf C, Lu J, Thomas J, Madore S: Assessment of the sensitivity and specificity of oligonucleo-tide (50 mer) microarrays. Nucleic Acid Research 2000, 28(22):4552. 10.1093/nar/28.22.4552View Article
- Hughes T, Mao M, Jones A, Burchard J, Marton M, Shannon K, Lefkowitz S, Ziman M, Schelter J, Meyer M, Kobayashi S, Davis C, Dai H, He Y, Stephaniants S, Cavet G, Walker W, West A, Coffey E, Shoemaker D, Stoughton R, Blanchard A, Friend S, Linsley P: Expression profiling using microarrays fabricated by an ink-jet oligonucleotide synthesizer. Nature Biotechnology 2001, 19(4):342–347. 10.1038/86730View ArticlePubMed
- Yuen T, Wurmbach E, Pfeffer RL, Ebersole BJ, Sealfon SC: Accuracy and calibration of commercial oligonucleotide and custom cDNA microarrays. Nucleic Acids Research 2002, 30(10):e48. 10.1093/nar/30.10.e48PubMed CentralView ArticlePubMed
- Barczak A, Rodriguez MW, Hanspers K, Koth LL, Tai YC, Bolstad BM, Speed TP, Erle DJ: Spotted long oligonucleotide arrays for human gene expression analysis. Genome Research 2003, 13(7):1775–1785. 10.1101/gr.1048803PubMed CentralView ArticlePubMed
- Carter M, Hamatani T, Sharov A, Carmack C, Qian Y, Aiba K, Ko N, Dudekula D, Brzoska P, Hwang S, Ko M: In situ-synthesized novel microarray optimized for mouse stem cell and early developmental expression profiling. Genome Research 2003, 13(3):1011–21. 10.1101/gr.878903PubMed CentralView ArticlePubMed
- Wang H, Malek R, Kwitek A, Greene A, Luu T, Behbahani B, Frank B, Quackenbush J, Lee N: Assessing unmodified 70-mer oligonucleotide performance on glass-slide microarrays. Genome Biology 2003, 4(1):R5. 10.1186/gb-2003-4-1-r5PubMed CentralView ArticlePubMed
- Kuo W, Jenssen T, Butte A, Ohno-Machado L, Kohane I: Analysis of mrna measurements from two different microarray technologies. Bioinformatics 2002, 18(3):405–412. 10.1093/bioinformatics/18.3.405View ArticlePubMed
- Kothapalli R, Yoder S, Mane S, L T Jr: Microarray results: how accurate are they? BMC Bioinformatics 2002, 3(1):22. 10.1186/1471-2105-3-22PubMed CentralView ArticlePubMed
- Li J, Pankratz M, Johnson J: Differential gene expression patterns revealed by oligo-nucleotide versus long cDNA arrays. Toxicological Sciences 2003, 69(2):383–390. 10.1093/toxsci/69.2.383View Article
- Tan P, Downey T, Spitznagel EJ, Xu P, Fu D, Dimitrov D, Lempicki R, Raaka B, Cam M: Evaluation of gene expression measurements from commercial platforms. Nucleic Acids Research 2003, 31(19):5676–5684. 10.1093/nar/gkg763PubMed CentralView ArticlePubMed
- Irizarry RA, Warren D, Spencer F, Kim IF, Biswal S, Frank BC, Gabrielson E, Garcia JG, Geoghegan J, Germino G, Griffn C, Hilmer SC, Hoffman E, Jedlicka AE, Kawasaki E, Martinez-Murillo F, Morsberger L, Lee H, Petersen D, Quackenbush J, Scott A, Wilson M, Yang Y, Ye SQ, Yu W: Multiple-laboratory comparison of microarray platforms. Nat Methods 2005, 2(5):345–50. Epub 2005 Apr 21. 10.1038/nmeth756View ArticlePubMed
- Rhodes DR, Barrette TR, Rubin MA, Ghosh D, Chinnaiyan AM: Meta-analysis of microarrays: interstudy validation of gene expression profiles reveals pathway dysregulation in prostate cancer. Cancer Res 62(15):4427–33. 2002 Aug 1;PubMed
- Parmigiani G, Garrett-Mayer ES, Anbazhagan R, Gabrielson E: A cross-study comparison of gene expression studies for the molecular classification of lung cancer. Clin Cancer Res 10(9):2922–7. 2004 May 1; 10.1158/1078-0432.CCR-03-0490View ArticlePubMed
- Choi JK, Yu U, Kim S, Yoo OJ: Combining multiple microarray studies and modeling interstudy variation. Bioinformatics 2003, 19(Suppl 1):i84–90. 10.1093/bioinformatics/btg1010View ArticlePubMed
- Hegdes LV, Olkin I: Statistical Methods for Meta-analysis. Academic Press, Orlando, FL; 1987.
- Hu P, Greenwood CM, Beyene J: Integrative analysis of multiple gene expression profiles with quality-adjusted effect size models. BMC Bioinformatics 6: 128. 2005 May 27; 10.1186/1471-2105-6-128
- Dabney AR, Storey JD: Normalization of two-channel microarrays accounting for experimental design and intensity-dependent relationships. Genome Biol 2007, 8(3):R44. 10.1186/gb-2007-8-3-r44PubMed CentralView ArticlePubMed
- MAQC Consortium, Shi L, Reid LH, Jones WD, Shippy R, Warrington JA, Baker SC, Collins PJ, de Longueville F, Kawasaki ES, Lee KY, Luo Y, Sun YA, Willey JC, Setterquist RA, Fischer GM, Tong W, Dragan YP, Dix DJ, Frueh FW, Goodsaid FM, Herman D, Jensen RV, Johnson CD, Lobenhofer EK, Puri RK, Schrf U, Thierry-Mieg J, Wang C, Wilson M, Slikker W Jr: The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotechnol 2006, 24(9):1151–61. 10.1038/nbt1239PubMed CentralView Article
- Gentlman R, Ruschaupt M, Huber W: On the Synthesis of Microarray Experiments. Bioconductor Project Working Papers. Working Paper 8. The Berkeley Electronic Press 2005. [http://www.bepress.com/bioconductor/paper8]
- Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JY, Zhang J: Bioconductor: Open software development for computational biology and bioinformatics. Genome Biol 2004, 5(10):R80. Epub 2004 Sep 15. 10.1186/gb-2004-5-10-r80PubMed CentralView ArticlePubMed
- Gale M Jr, Foy EM: Evasion of intracellular host defence by hepatitis C virus. Nature 436(7053):939–45. 2005 Aug 18; 10.1038/nature04078View ArticlePubMed
- Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M: KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 27(1):29–34. 1999 Jan 1; 10.1093/nar/27.1.29PubMed CentralView ArticlePubMed
- Bowen DG, Walker CM: Adaptive immune responses in acute and chronic hepatitis C virus infection. Nature 436(7053):946–52. 2005 Aug 18; 10.1038/nature04079View ArticlePubMed
- Pestova TV, Shatsky IN, Fletcher SP, Jackson RJ, Hellen CUT: A prokaryotic-like mode of cytoplasmic eukaryotic ribosome binding to the initiation codon during internal translation initiation of hepatitis C and classical swine fever virus RNAs. Genes Dev 1998, 12: 67–83. 10.1101/gad.12.1.67PubMed CentralView ArticlePubMed
- Laletina E, Graifer D, Malygin A, Ivanov A, Shatsky I, Karpova G: Proteins surrounding hairpin IIIe of the hepatitis C virus internal ribosome entry site on the human 40S ribosomal subunit. Nucleic Acids Res 34(7):2027–36. Print 2006. 2006 Apr 13; 10.1093/nar/gkl155PubMed CentralView ArticlePubMed
- Otto GA, Lukavsky PJ, Lancaster AM, Sarnow P, Puglisi JD: Ribosomal proteins mediate the hepatitis C virus IRES-HeLa 40S interaction. RNA 2002, 8(7):913–23. 10.1017/S1355838202022057PubMed CentralView ArticlePubMed
- Bohning D, Malzahn U, Dietz E, Schlattmann P, Viwatwongkasem C, Biggeri A: Some general points in estimating heterogeneity variance with the DerSimonian-Laird estimator. Biostatistics 2002, 3(4):445–57. 10.1093/biostatistics/3.4.445View ArticlePubMed
- DerSimonian R, Laird NM: Meta-analysis in clinical trials. Controlled Clinincal Trials 1986, 7: 177–188. 10.1016/0197-2456(86)90046-2View Article
- Borozan I, Chen L, Sun J, Tannis LL, Guindi M, Rotstein OD, Heathcote J, Edwards AM, Grant D, McGilvray ID: Gene expression profiling of acute liver stress during living donor liver transplantation. Am J Transplant 2006, 6(4):806–24. 10.1111/j.1600-6143.2006.01254.xView ArticlePubMed
- Chen L, Borozan I, Feld J, Sun J, Tannis LL, Coltescu C, Heathcote J, Edwards AM, McGilvray ID: Hepatic gene expression discriminates responders and nonresponders in treatment of chronic hepatitis C viral infection. Gastroenterology 2005, 128(5):1437–44. 10.1053/j.gastro.2005.01.059View ArticlePubMed
- Weng L, Dai H, Zhan Y, He Y, Stepaniants SB, Bassett DE: Rosetta error model for gene expression analysis. Bioinformatics 22(9):1111–21. 2006 May 1; Epub 2006 Mar 7. 10.1093/bioinformatics/btl045View ArticlePubMed
- Storey JD: A direct approach to false discovery rates. B. Journal of the Royal Statistical Society 2002, 64: 479–498. 10.1111/1467-9868.00346View Article
- Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Statist Soc B 1995, 57: 289–300.
- Tusher VG, Tibshirani R, Chu G: Significance Analysis of Microarrays Applied to the Ionizing Radiation Response. Proc Natl Acad Sci USA 98(9):5116–21. 2001 Apr 24; Epub 2001 Apr 17. 10.1073/pnas.091062498PubMed CentralView ArticlePubMed
- Kim KY, Ki DH, Jeong HJ, Jeung HC, Chung HC, Rha SY: Novel and simple transformation algorithm for combining microarray data sets. BMC Bioinformatics 2007, 8: 218. 10.1186/1471-2105-8-218PubMed CentralView ArticlePubMed
- Chen L, Borozan I, Milkiewicz P, Sun J, Meng X, Coltescu C, Edwards AM, Ostrowski MA, Guindi M, Heathcote EJ, McGilvray ID: Gene expression profiling of early primary biliary cirrhosis: possible insights into the mechanism of action of ursodeoxycholic acid. Liver Int 2008 Apr 15.
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.