 Research
 Open Access
 Published:
A statistical framework for integrating two microarray data sets in differential expression analysis
BMC Bioinformatics volume 10, Article number: S23 (2009)
Abstract
Background
Different microarray data sets can be collected for studying the same or similar diseases. We expect to achieve a more efficient analysis of differential expression if an efficient statistical method can be developed for integrating different microarray data sets. Although many statistical methods have been proposed for data integration, the genomewide concordance of different data sets has not been well considered in the analysis.
Results
Before considering data integration, it is necessary to evaluate the genomewide concordance so that misleading results can be avoided. Based on the test results, different subsequent actions are suggested. The evaluation of genomewide concordance and the data integration can be achieved based on the normal distribution based mixture models.
Conclusion
The results from our simulation study suggest that misleading results can be generated if the genomewide concordance issue is not appropriately considered. Our method provides a rigorous parametric solution. The results also show that our method is robust to certain model misspecification and is practically useful for the integrative analysis of differential expression.
Background
Microarray is an experimental method by which tens of thousands of genes can be printed on a small chip and their expression can be measured simultaneously [1, 2]. Since the microarray technology was introduced, it has been widely used in many biomedical studies [3, 4]. Microarrays can be used to measure expression for tens of thousands of genes at the mRNA level for samples in normal and disease groups, and then statistical methods for twosample comparison can be used to identify differentially expressed genes. Differentially expressed genes are potential disease related genes for clinical diagnoses and medical treatments. This approach has been successfully used in cancer studies [4, 5] as well as diabetes studies [6, 7].
Although microarray technology has been developed for more than a decade, the experiment cost is still considerably expensive. This limits the sample size of microarray studies. Therefore, the detection power can be low, especially when the signal of differential expression is relatively weak [8]. Many microarray data sets have been collected for the same or similar research purpose. Detecting genes with concordant behavior among different data sets is of biological interest. It is also of statistical interest to improve the detection power if it is feasible to integrate different data sets in differential expression analysis. For this reason, several methods have been proposed for data integration [9–14].
However, the genomewide concordance of different data sets has not been well considered in these integrative analyses. A gene selected for the followup analysis should behave concordantly in different data sets. For example, if a gene is upregulated in one experiment, then it should also be upregulated in another experiment. Slight inconsistency should be expected since there are considerable noises generated by microarray experiments. If two data sets are genomewide concordant, then integrating them can generally improve the sample size and reduce the noise impact. Therefore, it is desirable to combine observations of concordant genes since we expect to achieve a more powerful detection of differential expression. However, if two data sets are not genomewide concordant, then there are genes with discordant behavior in different data sets. There are many possible factors for such observations, such as population heterogeneity, probe binding issues from different microarray platforms, as well as labspecific system noises. Therefore, integrating observations of discordant genes may result in misleading conclusions and should be discouraged.
When a seemingly discordant behavior is observed for a gene, it is difficult to tell whether the observation is generated by random noises or the observation reflects the underlying truth. Therefore, it is not trivial to determine whether a gene has a concordant/discordant behavior in different experiments. The analysis will be more complicated for evaluating genomewide concordance. Cahan et al. [15] have studied different gene lists identified from different data sets. EinDor et al. [16] have showed that we may need to collect thousands of samples to generate a robust gene list for disease prediction. Miron et al. [17] have proposed a correlation based approach for measuring concordance between two lists of test statistics from two data sets. However, this approach does not consider the fact that different genes in a data set belong to different components (nondifferentially expressed, upregulated, downregulated, etc.). We have recently proposed a mixturemodel based method for testing genomewide concordance and discordance [18]. This approach considers the mixture of different gene components as well as the independence between two data sets, and it can be extended for data integration. In this study, we propose a mixturemodel based statistical framework to achieve a rigorous integrative analysis of differential expression.
In a recent study [19], it has been shown that the widely used overlap count (or Venn diagrams) is not an appropriate metric for measuring the reproducibility of differential expression analysis. It is necessary to develop new metrics for rigorously measuring the reproducibility of differential expression analysis. The disadvantage of overlap count metric is that the randomness of differential expression measures (e.g. ttest) has not been well considered. However, our mixturemodel based tests of genomewide concordance and discordance [18] take this randomness into account and the reported pvalues can be used as rigorous metrics for measuring the reproducibility of differential expression analysis.
For the rest of paper, we first introduce our statistical framework. Then, we use simulated data to evaluate its performance. Two experimental data based case studies are considered as the applications. Finally, we discuss the advantages and disadvantages of our method.
Methods
A statistical framework
Figure 1 provides an illustrative flow chat for our statistical framework. The integration of two microarray gene expression data sets is considered so that we can achieve a more powerful detection of concordantly differentially expressed genes. Here, we assume that two data sets have been preprocessed so that they contain the same gene list. The framework can be summarized as the procedure below. Then, we describe the detail of each step. (See our recent publication [18] for the technical detail of tests of complete concordance and complete discordance.)

1.
In each data set, perform a statistical test of differential expression for each gene to obtain a lists of test scores;

2.
For each list of test scores, perform a transformation procedure to obtain a list of zscores;

3.
For two lists of zscores, test the complete discordance between them;

(a)
if the complete discordance cannot be rejected, then the data integration will be discouraged;

(b)
if the complete discordance can be rejected, then continue to the next step;

4.
For two lists of zscores, test the complete concordance between them;

(a)
if the complete concordance cannot be rejected, then calculate a list of concordant integrative scores based on the complete concordance (CC) model;

(b)
if the complete concordance can be rejected, then calculate a list of concordant integrative scores based on the partial concordance/discordance (PCD) model;

5.
Use the list of concordant integrative scores to prioritize genes for the followup study.
Test of differential expression
For simplicity, we consider the Student's twosample ttest for differential expression analysis. Other test statistics, such as Wilcoxon's rank sum test or a generalized t/Fstatistic [20], can certainly be considered. The statistical significance (pvalue) of a test value can be evaluated based on either its theoretical null distribution or a permutation null distribution [21]. In this study, the theoretical pvalue is used for the simulation study since we know the underlying distribution; the permutation pvalue is used for the application since the underlying distribution is unknown (B = 500 is used as the number of permutations).
Transformation of test score
It has been suggested transforming a test value to its associated zscore so that more efficient results can be achieved in a normal mixture model based analysis [22]. When the onesided (uppertailed) pvalue of a test value is available, the associated zscore can be simply calculated by
z = Φ^{1}(1  p),
where Φ^{1}(·) is the inverse of standard normal distribution. Notice that it is necessary to use onesided pvalues since we intend to distinguish upregulated differential expression from downregulated differential expression.
Mixture models
We have proposed several mixture models [18] to evaluate the genomewide concordance/discordance between two lists of zscores: {(z_{1k}, z_{2k}) : k = 1, 2, ..., m}, where m is the number of common genes in both data sets. A general mixture model can be used to represent the case of partial concordance/discordance (PCD):
This model can be reduced to a complete concordance (CC) model:
and a complete discordance (CD) model:
More details for these models have been described in our recent publication [18]. In these models, index 0 is used to represent the null component with fixed parameters: μ_{0} = ν_{0} = 0 and ${\sigma}_{0}^{2}={\tau}_{0}^{2}=1$; indices 1 and 2 are used to represent the downregulated and upregulated components with constrains: μ_{1}, ν_{1} ≤ 0 and μ_{2}, ν_{2} ≥ 0; π_{ ij }is the proportion of genes belonging to the ith component in the first data set and jth component in the second data set (Σ_{ ij }π_{ ij }= 1). π_{i.}is the marginal proportion of genes belonging to the ith component in the first data set; and π._{ j }is the marginal proportion of genes belonging to the jth component in the second data set. The model parameters can be estimated through an EM algorithm [23]. The detail has also been described in our recent publication [18].
Tests of concordance and discordance
Based on the assumption of independence among the list of zscores, we can calculate the mixture model based likelihoods:
With these likelihoods, we can test PCD (H_{1}) against CC (H_{0}) or CD (H_{0}) by the following likelihood ratio tests in the logarithm scale:
The statistical significance of a test value can be evaluated by the parametric bootstrap procedure [24], which has also been described in our recent publication [18].
Data integration
If the complete discordance (CD) cannot be rejected, then the data integration will be discouraged to avoid misleading results. If CD can be rejected, then either the complete concordance (CC) or the partial concordance/discordance (PCD) will be established. Although CC is a special case of PCD, it is still statistically necessary to test PCD against CC. If CC cannot be rejected, then we expect to achieve a more efficient data integration by reducing the number of parameters. Under CC or PCD, it is feasible to consider the data integration. To prioritizing genes, we can consider a concordant integrative score, which is the conditional probability of concordantly differential expression under an appropriate mixture model: [P(observed pair of zscores both upregulated) + P(observed pair of zscores both downregulated)]/P(observed pair of zscores). Under the CC model, it is calculated as:
Under the PCD model, it is calculated as:
False positive control
The mixture model provides a rigorous convenience to estimate the number of false positives in a theoretical manner. Notice that the above concordant integrative score is actually a probability of true positive. Therefore, if we are interested in the top K genes {X_{(1)}, X_{(2)}, ..., X_{(K)}} ranked by the concordant integrative score, then the associated number of false positives can be estimated as:
where S(·) is calculated based on an appropriate mixture model (CC or PCD). With this estimate, one may realize that the false discovery rate [25] (FDR) based on the qvalue concept [26] can be simply estimated as $\widehat{FP}$/K.
For an individual data set, we simply use the Rpackage qvalue [26] to obtain the estimated FDR. Then, the number of false positives for the top K genes can be simple estimated as K × FDR(K), which is theoretically consistent with the above $\widehat{FP}$.
Results and discussion
Illustrative examples
Figure 2 shows three examples to illustrate the concepts of concordance and discordance. These examples are simulated based on the simulation configuration in the next subsection. The proportions of genes with discordant behavior in two data sets are ξ = 0%, 50% and 100%, respectively. Therefore, these are representative examples for the cases of complete concordance, partial concordance/discordance and complete discordance. The corresponding Pearson's correlation coefficients are 49.2%, 36.1% and 2.1%. Therefore, the correlation measure is not an appropriate metric to tell whether two data sets are concordant or not. One may also realize that the overlap count is neither a rigorous approach even for the first example (a case of complete concordance).
A simulation study
There are many parameters to be considered when we simulate microarray gene expression data:

Gene size m;

The proportion of nondifferentially expressed genes π_{0};

Sample sizes of two groups n_{1}, n_{2};

Distributions of expression measurements of differentially and nondifferentially expressed genes;

Covariance structure among genes.
In our simulation studies, we consider the widely used block structure: genes are partitioned into many blocks; genes within the same block are positively dependent; and different blocks are independent. To save the computing time, we reasonably set gene size m = 6000, π_{0} = 80% and n_{1} = n_{2} = 15 for each of two data sets. The block size (number of genes in each block) is set b = 25. Within each block, the expression measurements are simulated from a multivariate normal distribution. For blocks of nondifferentially expressed genes, we simulate expression measurements from N(${\widehat{\mu}}_{0}$, Σ_{0}), where ${\widehat{\mu}}_{0}$ is a b × 1 vector of 0's and Σ_{0} is a b × b matrix with diagonal entries as 1 and nondiagonal entries as a fixed value (simulated from a Uniform distribution U[0.5, 0.9]). For blocks of differentially expressed genes, we simulate expression measurements from N(${\widehat{\mu}}_{1}$, Σ_{1}) and N(${\widehat{\mu}}_{2}$, Σ_{2}) for the first and the second sample groups, respectively. ${\widehat{\mu}}_{2}$ is simply a b × 1 vector of 0's. To simulate ${\widehat{\mu}}_{1}$, we first simulate b × 1 vector of random numbers from a Beta distribution Beta(1.5, 1.5), multiply this vector by a factor r = 1.5, and then multiply randomly simulated signs (50% positive and 50% negative) so that both up and down regulated differential expression can be generated. Σ_{1} and Σ_{2} are similarly generated as Σ_{0}. Two data sets are first simulated based on the same configuration. To simulate genes with discordant behavior, we randomly reallocate ξ = 0%, 15%, 30%, 45%, 60% and 75% genes in the second data set so that these genes are no longer matched with those in the first data set. Notice that the simulation configuration is not completely consistent with what have been assumed for our method. This is intentionally designed to understand the robustness of our method. The complete concordance or the complete discordant will be rejected at the level pvalue < 0.025 since there is a issue of multiple hypothesis testing. To save computing time, we only perform the parametric bootstrap for 100 times to evaluate the pvalue of a test.
For each round of simulation, since we know the truth (simulation configuration), we can use the curve of number of concordantly differentially expressed genes (True Positives) against number of claimed ones (Claimed Positives) to evaluate the performance of our method. After many (50) rounds of simulations (it takes a long time for each round due to the parametric bootstrap procedure with the EM algorithm based estimation), we can take the average to obtain a smooth mean curve. Since the existing data integration methods [9–14] do not consider the issue of genomewide concordance/discordance, they are not included for comparison in this study. However, we have compared our method with the simple pooling approach: observations of the same gene from two data sets are simply combined for each sample group, and the ttest is applied to each gene in the pooled data set. This approach is feasible since the measurements in two simulated data sets are comparable. (Then, this approach is a desired efficient approach when two data sets are genomewide concordant. It is interesting to understand its loss of power when two sets are not genomewide concordant.)
Figure 3 shows the comparison between our method and the pooling approach. When ξ = 0% (complete concordance), the performance of two approaches are still comparable. (Notice that in such a situation, the pooling approach is an ideal choice.) Our tests of complete concordance (CC) and complete discordance (CD) result in 45 CC and 5 partial concordance/discordance (PCD) among 50 repetitions. The advantage of our method becomes clearer and clearer when ξ is increased from 15% to 75% (partial concordance/discordance): our method is clearly better when the number of claimed positives is within 1 to 1000 (notice that relatively lowly ranked genes are of less interest in microarray studies). Our tests of CC and CD result in 50 PCD among 50 repetitions for all these 5 configurations. In addition to the practical usefulness of our method, Figure 3 also confirms that it is importance to evaluate the genomewide concordance before the data integration can be considered. Otherwise, we may obtain seriously misleading analysis results.
Applications
A case study of partial concordance/discordance
Since NOD mouse spontaneously develops type 1 diabetes, it has been widely used for studying the disease. Based on a time course microarray study using samples collected from 3 weeks to 10 weeks, week 5 is a key checkpoint for the development of type 1 diabetes [7]. To distinguish the genes related to diabetes development from the genes related to aging, two other data sets have also been collected for two congenic strains: NOD.Idd3/Idd10 and NOD.B10SnH2, which do not spontaneously develop diabetes. Samples have been collected at different time points from 3 weeks to 10 weeks [7]. Although these two strains do not spontaneously develop type 1 diabetes, it is still interesting to understand their differential expression before 5 weeks vs. after 5 weeks. Furthermore, understanding genes with concordant/discordant behavior for these two strains is important. Therefore, the data set collected for each congenic strain is partitioned into two sample groups: for strain NOD.Idd3/Idd10, there are 11 and 13 subjects collected before 5 weeks and after 5 weeks, respectively; for strain NOD.B10SnH2, there are 22 and 10 subjects collected before 5 weeks and after 5 weeks, respectively. Measurements for 11,424 genes have been collected based on a cDNA microarray platform.
Figures 4a and 4b show the scatterplot for the paired pvalues (based on 500 permutations) and zscores of 11,424 genes based on these two data sets. It is difficult to evaluate the genomewide concordance/discordance based on the scatterplot of paired pvalues (Figure 4a). From the scatterplot of paired zscores (Figure 4b), the genomewide concordance seems quite satisfactory. However, based on 1000 parametric bootstraps (Table 1), the tests of complete concordance (CC) and complete discordance (CD) are both significant (p < 0.01). Therefore, both CC and CD models are rejected and the partial concordance model (PCD) should be used in the analysis. This is not surprising since certain genetic and biological differences are expected from these two similar strains. Table 2 gives the PCD model estimates. There are still about 75% genes with concordant behavior. The level of darkness in Figure 4b represents the level of being concordantly differentially expressed that is evaluated by the concordant integrative score based on the PCD model (we set 0.2 for the smallest darkness level so that these nondifferentially expressed genes can be visualized).
Figure 4c shows the estimated false discovery rates. Compared to the analysis based on individual data sets, the PCD model based false positive control is not necessarily better since we intend to detect these concordantly differentially expressed genes in the integrative data analysis. Furthermore, it is important that an appropriate model must be used for the data integration. Figure 4c also shows that the CC model based analysis results can be seriously misleading. Therefore, the tests of complete concordance and complete discordance are crucial before the data integration can be considered. Figure 4d–f compare the gene ranks based on the PCD model based integration and two individual data sets. The gene ranks based on two individual data sets are quite discordant: the Spearman's rank correlation is just 0.36. The integration based gene ranks and these based on two individual data sets are quite concordant: the Spearman's rank correlations are 0.78 and 0.72, respectively.
A case study of complete concordance
In practice, we may collect gene expression data for the same study from different laboratories based on different microarray platforms. These data usually cannot be directly combined for an analysis with a larger sample size. Our method can also be used to solve this problem. Although this situation has been discussed in our simulation study, it is still necessary to illustrate it with experimental data. Here, we generate a case of complete concordance based on an experimental data set. The data set was collected for a prostate cancer study [5]. Genomewide expression profiles for 6034 genes (after data preprocessing) have been measured for 50 normal and 52 cancerous subjects. We randomly split this data set into two subsets with equal sample sizes (25 normal and 26 cancerous subjects).
Figures 5a and 5b show the scatterplot for the paired pvalues (based on 500 permutations) and zscores of 6,034 genes based on these two subsets. They are highly genomewide concordant. This is consistent with our expectation. Based on 1000 parametric bootstraps (Table 3), the complete discordance is rejected (p < 0.01) but the complete concordance cannot be rejected (the associated pvalue is highly insignificant). Table 4 gives the CC model estimates. The estimates of mean and variance parameters in two subsets are consistent. The level of darkness in Figure 5b represents the level of being concordantly differentially expressed that is evaluated by the concordant integrative score based on the CC model (0.2 is set for the smallest darkness level so that these nondifferentially expressed genes can be visualized).
Figure 5c shows the estimated false discovery rates. Compared to the analysis based on individual data sets, the CC model based false positive control shows a clear improvement. However, the false positive control based on the original data (subsets 1 and 2 pooled together) is the best. This is consistent with our simulation results. Figure 5d–f compare the gene ranks based on the CC model based integration and two individual subsets. The gene ranks based on two individual subsets are quite discordant: the Spearman's rank correlation is just 0.50. The integration based gene ranks and these based on two individual data sets are quite concordant: the Spearman's rank correlations are both 0.81. Furthermore, the integration based gene ranks are highly concordant with these based on the original data (result not shown): the Spearman's rank correlation is 0.96.
Conclusion
In this study, we have proposed a statistical framework for integrating two microarray gene expression data sets in differential expression analysis. Our simulation and application results confirm that it is necessary to evaluate the genomewide concordance before the consideration of data integration. Otherwise, misleading results can be generated from the integrative analysis. Our current study focuses on the integration of two data sets with twosample groups. In our future study, we will generalize our method for multiple data sets. However, it is less straightforward to generalize our method for multisample groups since it is difficult to define the concordance/discordance for multiple groups.
Because of the randomness of data, we can always observe some intersection of genes selected from two data sets if the selection criterion is not stringent. This is the case even when two data sets are completed unrelated. (If the selection criterion is stringent, then we may always observe a null intersection even when two data sets are actually related.) Therefore, the genomewide concordance/discordance is a critical issue in the integrative analysis microarray data. The traditional hypergeometric analysis relies on the criterion of gene selection, which can be quite arbitrary in practice. For example, the results based on the threshold of 5%, 10% or 20% false discovery rates can be considerably different. It is not a rigorous approach to address the genomewide concordance/discordance. In a recent study [19], it has also been shown that the widely used overlap count (or Venn diagrams) is not an appropriate metric for measuring the reproducibility of differential expression analysis. Furthermore, it is not clear how to rank genes efficiently in the intersection of genes selected from two data sets.
To our knowledge, there is no other existing methods for evaluating genomewide concordance/discordance before the consideration of data integration. Our mixture model based approach is simple and intuitive. There are usually 3 major gene groups in a data set: upregulated, downregulated and null genes, which correspond to the three components in our model. The model inference is welldeveloped in the field of statistics. Furthermore, our model allows us to provide rigorous ranks for genes analyzed in two data sets. In our simulation study, our method can still provide a comparable performance in the situation of complete genomewide concordance when the ideal pooling approach is feasible. If two data sets are not completely concordant, then our method will provide a better performance.
Our method has several advantages. It allows us to test genomewide concordance/discordance, which is a critical issue before the data integration can be considered. It is a likelihoodbased approach, which is efficient when the underlying model is not seriously misspecified. We have also showed the robustness of our method through a simulation study when the underlying models are somewhat inconsistent. Furthermore, the data integration is achieved through a rigorously defined probability with close formulas.
Our method also has the following disadvantages. It is difficult to validate the assumed mixture model. However, without this assumption, we currently have no effect approach for evaluating genomewide concordance/discordance. Furthermore, the calculation of likelihood assumes that the test scores from different genes are independent. However, it is wellknown that the covariance structure of a microarray gene expression data set can be complicated. In our future study, we will explore more efficient approaches to overcome these disadvantages.
References
 1.
Schena M, Shalon D, Davis RW, Brown PO: Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995, 270: 467470. 10.1126/science.270.5235.467.
 2.
Lockhart D, Dong H, Byrne M, Follettie M, Gallo M, Chee M, Mittmann M, Wang C, Kobayashi M, Horton H, Brown E: Expression monitoring by hybridization to highdensity oligonuleotide arrays. Nature Biotechnology. 1996, 14: 16751680. 10.1038/nbt12961675.
 3.
Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycleregulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell. 1998, 9: 32733297.
 4.
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999, 286: 531537. 10.1126/science.286.5439.531.
 5.
Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D'Amico AV, Richie JP, Lander ES, Loda M, Kantoff PW, Golub TR, Sellers WR: Gene expression correlates of clinical prostate cancer behavior. Cancer Cell. 2002, 1: 203209. 10.1016/S15356108(02)000302.
 6.
Wilson KHS, Eckenrode SE, Li QZ, Ruan QG, Yang P, Shi JD, DavoodiSemiromi A, Mclndoe RA, Croker BP, She JX: Microarray analysis of gene expression in the kidneys of new and postonset diabetic NOD mice. Diabetes. 2003, 52: 21512159. 10.2337/diabetes.52.8.2151.
 7.
Eckenrode SE, Ruan Q, Yang P, Zheng W, Mclndoe RA, She JX: Gene expression profiles define a key checkpoint for type 1 diabetes in NOD mice. Diabetes. 2004, 53: 366375. 10.2337/diabetes.53.2.366.
 8.
Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E, Houstis N, Daly MJ, Patterson N, Mesirov JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop L: PGC1αresponse genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genetics. 2003, 34: 267273. 10.1038/ng1180.
 9.
Choi JK, Yu U, Kim S, Yoo OJ: Combining multiple microarray studies and modeling interstudy variation. Bioinformatics. 2003, 19 (Supplement 1): i8490. 10.1093/bioinformatics/btg1010.
 10.
Xu L, Tan AC, Naiman DQ, Geman D, Winslow RL: Robust prostate cancer marker genes emerge from direct integration of interstudy microarray data. Bioinformatics. 2005, 21: 39053911. 10.1093/bioinformatics/bti647.
 11.
Conlon EM, Song JJ, Liu JS: Bayesian models for pooling microarray studies with multiple sources of replications. BMC Bioinformatics. 2006, 7: 24710.1186/147121057247.
 12.
Hong F, R RB: A comparison of metaanalysis methods for detecting differentially expressed genes in microarray experiments. Bioinformatics. 2008, 24: 374382. 10.1093/bioinformatics/btm620.
 13.
Xu L, Tan AC, Winslow RL, Geman D: Merging microarray data from separate breast cancer studies provides a robust prognostic test. BMC Bioinformatics. 2008, 9: 12510.1186/147121059125.
 14.
Borozan I, Chen L, Paeper B, Heathcote JE, Edwards AM, Katze M, Zhang Z, McGilvray ID: MAID: An effect size based model for microarray data integration across laboratories and platforms. BMC Bioinformatics. 2008, 9: 30510.1186/147121059305.
 15.
Cahan P, Ahmad AM, Burke H, Fu S, Lai Y, Florea L, Dharker N, Kobrinski T, Kale P, McCaffrey TA: List of listsannotated (LOLA): a database for annotation and comparison of published microarray gene lists. Gene. 2005, 360: 7882. 10.1016/j.gene.2005.07.008.
 16.
EinDor L, Zuk O, Domany E: Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proceedings of the National Academy of Sciences USA. 2006, 103: 59235928. 10.1073/pnas.0601231103.
 17.
Miron M, Woody OZ, Marcil A, Murie C, Sladek R, Nadon R: A methodology for global validation of microarray experiments. BMC Bioinformatics. 2006, 7: 33310.1186/147121057333.
 18.
Lai Y, Adam BL, Podolsky R, She JX: A mixture model approach to the tests of concordance and discordance between two large scale experiments with twosample groups. Bioinformatics. 2007, 23: 12431250. 10.1093/bioinformatics/btm103.
 19.
Zhang M, Yao C, Guo Z, Zou J, Zhang L, Xiao H, Wang D, Yang D, Gong X, Zhu J, Li Y, Li X: Apparently low reproducibility of true differential expression discoveries in microarray studies. Bioinformatics. 2008,
 20.
Cui X, Hwang JTG, Qiu J, Blades NJ, Churchill GA: Improved statistical tests for differential gene expression by shrinking variance components estimates. Biostatistics. 2005, 6: 5975. 10.1093/biostatistics/kxh018.
 21.
Dudoit S, Shaffer JP, Boldrick JC: Multiple hypothesis testing in microarray experiments. Statistical Science. 2003, 18: 71103. 10.1214/ss/1056397487.
 22.
McLachlan GJ, Bean RW, Jones LB: A simple implementation of a normal mixture approach to differential gene expression in multiclass microarrays. Bioinformatics. 2006, 22: 16081615. 10.1093/bioinformatics/btl148.
 23.
McLachlan GJ, Krishnan T: The EM algorithm and extensions. 1997, John Wiley & Sons, Inc
 24.
McLachlan GJ: On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. Applied Statistics. 1987, 36: 318324. 10.2307/2347790.
 25.
Benjamini Y, Hochberg Y: Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B. 1995, 57: 289300.
 26.
Storey JD, Tibshirani R: Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences, USA. 2003, 100: 94409445. 10.1073/pnas.1530509100.
 27.
Web link for Rcode. [http://home.gwu.edu/~ylai/research/Concordance]
Acknowledgements
The Rcode is freely available at the author's website [27]. This work was partially supported by NIH grants DK75004 (Y. Lai), HD37800 (JX. She) and HD50196 (JX. She).
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 1, 2009: Proceedings of The Seventh Asia Pacific Bioinformatics Conference (APBC) 2009. The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/10?issue=S1
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
Y Lai conceived of the study, developed the methods, performed the statistical analysis, and drafted the manuscript; SE Eckenrode carried out the microarray experiments; JX She designed the microarray experiments and helped to draft the manuscript. All authors read and approved the final manuscript.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Lai, Y., Eckenrode, S.E. & She, J. A statistical framework for integrating two microarray data sets in differential expression analysis. BMC Bioinformatics 10, S23 (2009). https://doi.org/10.1186/1471210510S1S23
Published:
Keywords
 Mixture Model
 Data Integration
 Differential Expression Analysis
 Congenic Strain
 Simulation Configuration