- Open Access
A statistical framework for integrating two microarray data sets in differential expression analysis
© Lai et al; licensee BioMed Central Ltd. 2009
- Published: 30 January 2009
Different microarray data sets can be collected for studying the same or similar diseases. We expect to achieve a more efficient analysis of differential expression if an efficient statistical method can be developed for integrating different microarray data sets. Although many statistical methods have been proposed for data integration, the genome-wide concordance of different data sets has not been well considered in the analysis.
Before considering data integration, it is necessary to evaluate the genome-wide concordance so that misleading results can be avoided. Based on the test results, different subsequent actions are suggested. The evaluation of genome-wide concordance and the data integration can be achieved based on the normal distribution based mixture models.
The results from our simulation study suggest that misleading results can be generated if the genome-wide concordance issue is not appropriately considered. Our method provides a rigorous parametric solution. The results also show that our method is robust to certain model misspecification and is practically useful for the integrative analysis of differential expression.
- Mixture Model
- Data Integration
- Differential Expression Analysis
- Congenic Strain
- Simulation Configuration
Microarray is an experimental method by which tens of thousands of genes can be printed on a small chip and their expression can be measured simultaneously [1, 2]. Since the microarray technology was introduced, it has been widely used in many biomedical studies [3, 4]. Microarrays can be used to measure expression for tens of thousands of genes at the mRNA level for samples in normal and disease groups, and then statistical methods for two-sample comparison can be used to identify differentially expressed genes. Differentially expressed genes are potential disease related genes for clinical diagnoses and medical treatments. This approach has been successfully used in cancer studies [4, 5] as well as diabetes studies [6, 7].
Although microarray technology has been developed for more than a decade, the experiment cost is still considerably expensive. This limits the sample size of microarray studies. Therefore, the detection power can be low, especially when the signal of differential expression is relatively weak . Many microarray data sets have been collected for the same or similar research purpose. Detecting genes with concordant behavior among different data sets is of biological interest. It is also of statistical interest to improve the detection power if it is feasible to integrate different data sets in differential expression analysis. For this reason, several methods have been proposed for data integration [9–14].
However, the genome-wide concordance of different data sets has not been well considered in these integrative analyses. A gene selected for the follow-up analysis should behave concordantly in different data sets. For example, if a gene is up-regulated in one experiment, then it should also be up-regulated in another experiment. Slight inconsistency should be expected since there are considerable noises generated by microarray experiments. If two data sets are genome-wide concordant, then integrating them can generally improve the sample size and reduce the noise impact. Therefore, it is desirable to combine observations of concordant genes since we expect to achieve a more powerful detection of differential expression. However, if two data sets are not genome-wide concordant, then there are genes with discordant behavior in different data sets. There are many possible factors for such observations, such as population heterogeneity, probe binding issues from different microarray platforms, as well as lab-specific system noises. Therefore, integrating observations of discordant genes may result in misleading conclusions and should be discouraged.
When a seemingly discordant behavior is observed for a gene, it is difficult to tell whether the observation is generated by random noises or the observation reflects the underlying truth. Therefore, it is not trivial to determine whether a gene has a concordant/discordant behavior in different experiments. The analysis will be more complicated for evaluating genome-wide concordance. Cahan et al.  have studied different gene lists identified from different data sets. Ein-Dor et al.  have showed that we may need to collect thousands of samples to generate a robust gene list for disease prediction. Miron et al.  have proposed a correlation based approach for measuring concordance between two lists of test statistics from two data sets. However, this approach does not consider the fact that different genes in a data set belong to different components (non-differentially expressed, up-regulated, down-regulated, etc.). We have recently proposed a mixture-model based method for testing genome-wide concordance and discordance . This approach considers the mixture of different gene components as well as the independence between two data sets, and it can be extended for data integration. In this study, we propose a mixture-model based statistical framework to achieve a rigorous integrative analysis of differential expression.
In a recent study , it has been shown that the widely used overlap count (or Venn diagrams) is not an appropriate metric for measuring the reproducibility of differential expression analysis. It is necessary to develop new metrics for rigorously measuring the reproducibility of differential expression analysis. The disadvantage of overlap count metric is that the randomness of differential expression measures (e.g. t-test) has not been well considered. However, our mixture-model based tests of genome-wide concordance and discordance  take this randomness into account and the reported p-values can be used as rigorous metrics for measuring the reproducibility of differential expression analysis.
For the rest of paper, we first introduce our statistical framework. Then, we use simulated data to evaluate its performance. Two experimental data based case studies are considered as the applications. Finally, we discuss the advantages and disadvantages of our method.
A statistical framework
In each data set, perform a statistical test of differential expression for each gene to obtain a lists of test scores;
For each list of test scores, perform a transformation procedure to obtain a list of z-scores;
For two lists of z-scores, test the complete discordance between them;
if the complete discordance cannot be rejected, then the data integration will be discouraged;
if the complete discordance can be rejected, then continue to the next step;
For two lists of z-scores, test the complete concordance between them;
if the complete concordance cannot be rejected, then calculate a list of concordant integrative scores based on the complete concordance (CC) model;
if the complete concordance can be rejected, then calculate a list of concordant integrative scores based on the partial concordance/discordance (PCD) model;
Use the list of concordant integrative scores to prioritize genes for the follow-up study.
Test of differential expression
For simplicity, we consider the Student's two-sample t-test for differential expression analysis. Other test statistics, such as Wilcoxon's rank sum test or a generalized t/F-statistic , can certainly be considered. The statistical significance (p-value) of a test value can be evaluated based on either its theoretical null distribution or a permutation null distribution . In this study, the theoretical p-value is used for the simulation study since we know the underlying distribution; the permutation p-value is used for the application since the underlying distribution is unknown (B = 500 is used as the number of permutations).
Transformation of test score
It has been suggested transforming a test value to its associated z-score so that more efficient results can be achieved in a normal mixture model based analysis . When the one-sided (upper-tailed) p-value of a test value is available, the associated z-score can be simply calculated by
z = Φ-1(1 - p),
where Φ-1(·) is the inverse of standard normal distribution. Notice that it is necessary to use one-sided p-values since we intend to distinguish up-regulated differential expression from down-regulated differential expression.
More details for these models have been described in our recent publication . In these models, index 0 is used to represent the null component with fixed parameters: μ0 = ν0 = 0 and ; indices 1 and 2 are used to represent the down-regulated and up-regulated components with constrains: μ1, ν1 ≤ 0 and μ2, ν2 ≥ 0; π ij is the proportion of genes belonging to the i-th component in the first data set and j-th component in the second data set (Σ ij π ij = 1). πi.is the marginal proportion of genes belonging to the i-th component in the first data set; and π. j is the marginal proportion of genes belonging to the j-th component in the second data set. The model parameters can be estimated through an E-M algorithm . The detail has also been described in our recent publication .
Tests of concordance and discordance
False positive control
where S(·) is calculated based on an appropriate mixture model (CC or PCD). With this estimate, one may realize that the false discovery rate  (FDR) based on the q-value concept  can be simply estimated as /K.
For an individual data set, we simply use the R-package qvalue  to obtain the estimated FDR. Then, the number of false positives for the top K genes can be simple estimated as K × FDR(K), which is theoretically consistent with the above .
A simulation study
There are many parameters to be considered when we simulate microarray gene expression data:
Gene size m;
The proportion of non-differentially expressed genes π0;
Sample sizes of two groups n1, n2;
Distributions of expression measurements of differentially and non-differentially expressed genes;
Covariance structure among genes.
In our simulation studies, we consider the widely used block structure: genes are partitioned into many blocks; genes within the same block are positively dependent; and different blocks are independent. To save the computing time, we reasonably set gene size m = 6000, π0 = 80% and n1 = n2 = 15 for each of two data sets. The block size (number of genes in each block) is set b = 25. Within each block, the expression measurements are simulated from a multivariate normal distribution. For blocks of non-differentially expressed genes, we simulate expression measurements from N(, Σ0), where is a b × 1 vector of 0's and Σ0 is a b × b matrix with diagonal entries as 1 and non-diagonal entries as a fixed value (simulated from a Uniform distribution U[0.5, 0.9]). For blocks of differentially expressed genes, we simulate expression measurements from N(, Σ1) and N(, Σ2) for the first and the second sample groups, respectively. is simply a b × 1 vector of 0's. To simulate , we first simulate b × 1 vector of random numbers from a Beta distribution Beta(1.5, 1.5), multiply this vector by a factor r = 1.5, and then multiply randomly simulated signs (50% positive and 50% negative) so that both up and down regulated differential expression can be generated. Σ1 and Σ2 are similarly generated as Σ0. Two data sets are first simulated based on the same configuration. To simulate genes with discordant behavior, we randomly reallocate ξ = 0%, 15%, 30%, 45%, 60% and 75% genes in the second data set so that these genes are no longer matched with those in the first data set. Notice that the simulation configuration is not completely consistent with what have been assumed for our method. This is intentionally designed to understand the robustness of our method. The complete concordance or the complete discordant will be rejected at the level p-value < 0.025 since there is a issue of multiple hypothesis testing. To save computing time, we only perform the parametric bootstrap for 100 times to evaluate the p-value of a test.
For each round of simulation, since we know the truth (simulation configuration), we can use the curve of number of concordantly differentially expressed genes (True Positives) against number of claimed ones (Claimed Positives) to evaluate the performance of our method. After many (50) rounds of simulations (it takes a long time for each round due to the parametric bootstrap procedure with the E-M algorithm based estimation), we can take the average to obtain a smooth mean curve. Since the existing data integration methods [9–14] do not consider the issue of genome-wide concordance/discordance, they are not included for comparison in this study. However, we have compared our method with the simple pooling approach: observations of the same gene from two data sets are simply combined for each sample group, and the t-test is applied to each gene in the pooled data set. This approach is feasible since the measurements in two simulated data sets are comparable. (Then, this approach is a desired efficient approach when two data sets are genome-wide concordant. It is interesting to understand its loss of power when two sets are not genome-wide concordant.)
A case study of partial concordance/discordance
Since NOD mouse spontaneously develops type 1 diabetes, it has been widely used for studying the disease. Based on a time course microarray study using samples collected from 3 weeks to 10 weeks, week 5 is a key checkpoint for the development of type 1 diabetes . To distinguish the genes related to diabetes development from the genes related to aging, two other data sets have also been collected for two congenic strains: NOD.Idd3/Idd10 and NOD.B10Sn-H2, which do not spontaneously develop diabetes. Samples have been collected at different time points from 3 weeks to 10 weeks . Although these two strains do not spontaneously develop type 1 diabetes, it is still interesting to understand their differential expression before 5 weeks vs. after 5 weeks. Furthermore, understanding genes with concordant/discordant behavior for these two strains is important. Therefore, the data set collected for each congenic strain is partitioned into two sample groups: for strain NOD.Idd3/Idd10, there are 11 and 13 subjects collected before 5 weeks and after 5 weeks, respectively; for strain NOD.B10Sn-H2, there are 22 and 10 subjects collected before 5 weeks and after 5 weeks, respectively. Measurements for 11,424 genes have been collected based on a cDNA microarray platform.
Application results. The parametric bootstrap based null quantiles are used to evaluate the significance (p-values) of the tests of complete discordance and complete concordance between two NOD mouse data sets.
Quantile under Null
Complete Discordance (TCD)
Complete Concordance (TCC)
Application results. The parameters in the PCD model for two NOD mouse data sets are estimated through an E-M algorithm.
Data Set Two (NOD.B10Sn-H2)
Data Set One
Figure 4c shows the estimated false discovery rates. Compared to the analysis based on individual data sets, the PCD model based false positive control is not necessarily better since we intend to detect these concordantly differentially expressed genes in the integrative data analysis. Furthermore, it is important that an appropriate model must be used for the data integration. Figure 4c also shows that the CC model based analysis results can be seriously misleading. Therefore, the tests of complete concordance and complete discordance are crucial before the data integration can be considered. Figure 4d–f compare the gene ranks based on the PCD model based integration and two individual data sets. The gene ranks based on two individual data sets are quite discordant: the Spearman's rank correlation is just 0.36. The integration based gene ranks and these based on two individual data sets are quite concordant: the Spearman's rank correlations are 0.78 and 0.72, respectively.
A case study of complete concordance
In practice, we may collect gene expression data for the same study from different laboratories based on different microarray platforms. These data usually cannot be directly combined for an analysis with a larger sample size. Our method can also be used to solve this problem. Although this situation has been discussed in our simulation study, it is still necessary to illustrate it with experimental data. Here, we generate a case of complete concordance based on an experimental data set. The data set was collected for a prostate cancer study . Genome-wide expression profiles for 6034 genes (after data preprocessing) have been measured for 50 normal and 52 cancerous subjects. We randomly split this data set into two subsets with equal sample sizes (25 normal and 26 cancerous subjects).
Application results. The parametric bootstrap based null quantiles are used to evaluate the significance (p-values) of the tests of complete discordance and complete concordance between two NOD mouse data sets.
Quantile under Null
Complete Discordance (TCD)
Complete Concordance (TCC)
Application results. The parameters in the CC model for two prostate cancer subsets are estimated through an E-M algorithm.
Figure 5c shows the estimated false discovery rates. Compared to the analysis based on individual data sets, the CC model based false positive control shows a clear improvement. However, the false positive control based on the original data (subsets 1 and 2 pooled together) is the best. This is consistent with our simulation results. Figure 5d–f compare the gene ranks based on the CC model based integration and two individual subsets. The gene ranks based on two individual subsets are quite discordant: the Spearman's rank correlation is just 0.50. The integration based gene ranks and these based on two individual data sets are quite concordant: the Spearman's rank correlations are both 0.81. Furthermore, the integration based gene ranks are highly concordant with these based on the original data (result not shown): the Spearman's rank correlation is 0.96.
In this study, we have proposed a statistical framework for integrating two microarray gene expression data sets in differential expression analysis. Our simulation and application results confirm that it is necessary to evaluate the genome-wide concordance before the consideration of data integration. Otherwise, misleading results can be generated from the integrative analysis. Our current study focuses on the integration of two data sets with two-sample groups. In our future study, we will generalize our method for multiple data sets. However, it is less straightforward to generalize our method for multi-sample groups since it is difficult to define the concordance/discordance for multiple groups.
Because of the randomness of data, we can always observe some intersection of genes selected from two data sets if the selection criterion is not stringent. This is the case even when two data sets are completed unrelated. (If the selection criterion is stringent, then we may always observe a null intersection even when two data sets are actually related.) Therefore, the genome-wide concordance/discordance is a critical issue in the integrative analysis microarray data. The traditional hyper-geometric analysis relies on the criterion of gene selection, which can be quite arbitrary in practice. For example, the results based on the threshold of 5%, 10% or 20% false discovery rates can be considerably different. It is not a rigorous approach to address the genome-wide concordance/discordance. In a recent study , it has also been shown that the widely used overlap count (or Venn diagrams) is not an appropriate metric for measuring the reproducibility of differential expression analysis. Furthermore, it is not clear how to rank genes efficiently in the intersection of genes selected from two data sets.
To our knowledge, there is no other existing methods for evaluating genome-wide concordance/discordance before the consideration of data integration. Our mixture model based approach is simple and intuitive. There are usually 3 major gene groups in a data set: up-regulated, down-regulated and null genes, which correspond to the three components in our model. The model inference is well-developed in the field of statistics. Furthermore, our model allows us to provide rigorous ranks for genes analyzed in two data sets. In our simulation study, our method can still provide a comparable performance in the situation of complete genome-wide concordance when the ideal pooling approach is feasible. If two data sets are not completely concordant, then our method will provide a better performance.
Our method has several advantages. It allows us to test genome-wide concordance/discordance, which is a critical issue before the data integration can be considered. It is a likelihood-based approach, which is efficient when the underlying model is not seriously mis-specified. We have also showed the robustness of our method through a simulation study when the underlying models are somewhat inconsistent. Furthermore, the data integration is achieved through a rigorously defined probability with close formulas.
Our method also has the following disadvantages. It is difficult to validate the assumed mixture model. However, without this assumption, we currently have no effect approach for evaluating genome-wide concordance/discordance. Furthermore, the calculation of likelihood assumes that the test scores from different genes are independent. However, it is well-known that the covariance structure of a microarray gene expression data set can be complicated. In our future study, we will explore more efficient approaches to overcome these disadvantages.
The R-code is freely available at the author's website . This work was partially supported by NIH grants DK-75004 (Y. Lai), HD-37800 (J-X. She) and HD-50196 (J-X. She).
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 1, 2009: Proceedings of The Seventh Asia Pacific Bioinformatics Conference (APBC) 2009. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S1
- Schena M, Shalon D, Davis RW, Brown PO: Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995, 270: 467-470. 10.1126/science.270.5235.467.View ArticlePubMedGoogle Scholar
- Lockhart D, Dong H, Byrne M, Follettie M, Gallo M, Chee M, Mittmann M, Wang C, Kobayashi M, Horton H, Brown E: Expression monitoring by hybridization to high-density oligonuleotide arrays. Nature Biotechnology. 1996, 14: 1675-1680. 10.1038/nbt1296-1675.View ArticlePubMedGoogle Scholar
- Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell. 1998, 9: 3273-3297.PubMed CentralView ArticlePubMedGoogle Scholar
- Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999, 286: 531-537. 10.1126/science.286.5439.531.View ArticlePubMedGoogle Scholar
- Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D'Amico AV, Richie JP, Lander ES, Loda M, Kantoff PW, Golub TR, Sellers WR: Gene expression correlates of clinical prostate cancer behavior. Cancer Cell. 2002, 1: 203-209. 10.1016/S1535-6108(02)00030-2.View ArticlePubMedGoogle Scholar
- Wilson KHS, Eckenrode SE, Li QZ, Ruan QG, Yang P, Shi JD, Davoodi-Semiromi A, Mclndoe RA, Croker BP, She JX: Microarray analysis of gene expression in the kidneys of new- and post-onset diabetic NOD mice. Diabetes. 2003, 52: 2151-2159. 10.2337/diabetes.52.8.2151.View ArticlePubMedGoogle Scholar
- Eckenrode SE, Ruan Q, Yang P, Zheng W, Mclndoe RA, She JX: Gene expression profiles define a key checkpoint for type 1 diabetes in NOD mice. Diabetes. 2004, 53: 366-375. 10.2337/diabetes.53.2.366.View ArticlePubMedGoogle Scholar
- Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E, Houstis N, Daly MJ, Patterson N, Mesirov JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop L: PGC-1α-response genes involved in oxidative phos-phorylation are coordinately downregulated in human diabetes. Nature Genetics. 2003, 34: 267-273. 10.1038/ng1180.View ArticlePubMedGoogle Scholar
- Choi JK, Yu U, Kim S, Yoo OJ: Combining multiple microarray studies and modeling interstudy variation. Bioinformatics. 2003, 19 (Supplement 1): i84-90. 10.1093/bioinformatics/btg1010.View ArticlePubMedGoogle Scholar
- Xu L, Tan AC, Naiman DQ, Geman D, Winslow RL: Robust prostate cancer marker genes emerge from direct integration of inter-study microarray data. Bioinformatics. 2005, 21: 3905-3911. 10.1093/bioinformatics/bti647.View ArticlePubMedGoogle Scholar
- Conlon EM, Song JJ, Liu JS: Bayesian models for pooling microarray studies with multiple sources of replications. BMC Bioinformatics. 2006, 7: 247-10.1186/1471-2105-7-247.PubMed CentralView ArticlePubMedGoogle Scholar
- Hong F, R RB: A comparison of meta-analysis methods for detecting differentially expressed genes in microarray experiments. Bioinformatics. 2008, 24: 374-382. 10.1093/bioinformatics/btm620.View ArticlePubMedGoogle Scholar
- Xu L, Tan AC, Winslow RL, Geman D: Merging microarray data from separate breast cancer studies provides a robust prognostic test. BMC Bioinformatics. 2008, 9: 125-10.1186/1471-2105-9-125.PubMed CentralView ArticlePubMedGoogle Scholar
- Borozan I, Chen L, Paeper B, Heathcote JE, Edwards AM, Katze M, Zhang Z, McGilvray ID: MAID: An effect size based model for microarray data integration across laboratories and platforms. BMC Bioinformatics. 2008, 9: 305-10.1186/1471-2105-9-305.PubMed CentralView ArticlePubMedGoogle Scholar
- Cahan P, Ahmad AM, Burke H, Fu S, Lai Y, Florea L, Dharker N, Kobrinski T, Kale P, McCaffrey TA: List of lists-annotated (LOLA): a database for annotation and comparison of published microarray gene lists. Gene. 2005, 360: 78-82. 10.1016/j.gene.2005.07.008.View ArticlePubMedGoogle Scholar
- Ein-Dor L, Zuk O, Domany E: Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proceedings of the National Academy of Sciences USA. 2006, 103: 5923-5928. 10.1073/pnas.0601231103.View ArticleGoogle Scholar
- Miron M, Woody OZ, Marcil A, Murie C, Sladek R, Nadon R: A methodology for global validation of microarray experiments. BMC Bioinformatics. 2006, 7: 333-10.1186/1471-2105-7-333.PubMed CentralView ArticlePubMedGoogle Scholar
- Lai Y, Adam BL, Podolsky R, She JX: A mixture model approach to the tests of concordance and discordance between two large scale experiments with two-sample groups. Bioinformatics. 2007, 23: 1243-1250. 10.1093/bioinformatics/btm103.View ArticlePubMedGoogle Scholar
- Zhang M, Yao C, Guo Z, Zou J, Zhang L, Xiao H, Wang D, Yang D, Gong X, Zhu J, Li Y, Li X: Apparently low reproducibility of true differential expression discoveries in microarray studies. Bioinformatics. 2008,Google Scholar
- Cui X, Hwang JTG, Qiu J, Blades NJ, Churchill GA: Improved statistical tests for differential gene expression by shrinking variance components estimates. Biostatistics. 2005, 6: 59-75. 10.1093/biostatistics/kxh018.View ArticlePubMedGoogle Scholar
- Dudoit S, Shaffer JP, Boldrick JC: Multiple hypothesis testing in microarray experiments. Statistical Science. 2003, 18: 71-103. 10.1214/ss/1056397487.View ArticleGoogle Scholar
- McLachlan GJ, Bean RW, Jones LB: A simple implementation of a normal mixture approach to differential gene expression in multiclass microarrays. Bioinformatics. 2006, 22: 1608-1615. 10.1093/bioinformatics/btl148.View ArticlePubMedGoogle Scholar
- McLachlan GJ, Krishnan T: The EM algorithm and extensions. 1997, John Wiley & Sons, IncGoogle Scholar
- McLachlan GJ: On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. Applied Statistics. 1987, 36: 318-324. 10.2307/2347790.View ArticleGoogle Scholar
- Benjamini Y, Hochberg Y: Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B. 1995, 57: 289-300.Google Scholar
- Storey JD, Tibshirani R: Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences, USA. 2003, 100: 9440-9445. 10.1073/pnas.1530509100.View ArticleGoogle Scholar
- Web link for R-code. [http://home.gwu.edu/~ylai/research/Concordance]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.