 Research
 Open Access
A statistical framework for integrating two microarray data sets in differential expression analysis
 Yinglei Lai^{1}Email author,
 Sarah E Eckenrode^{2} and
 JinXiong She^{2}
https://doi.org/10.1186/1471210510S1S23
© Lai et al; licensee BioMed Central Ltd. 2009
 Published: 30 January 2009
Abstract
Background
Different microarray data sets can be collected for studying the same or similar diseases. We expect to achieve a more efficient analysis of differential expression if an efficient statistical method can be developed for integrating different microarray data sets. Although many statistical methods have been proposed for data integration, the genomewide concordance of different data sets has not been well considered in the analysis.
Results
Before considering data integration, it is necessary to evaluate the genomewide concordance so that misleading results can be avoided. Based on the test results, different subsequent actions are suggested. The evaluation of genomewide concordance and the data integration can be achieved based on the normal distribution based mixture models.
Conclusion
The results from our simulation study suggest that misleading results can be generated if the genomewide concordance issue is not appropriately considered. Our method provides a rigorous parametric solution. The results also show that our method is robust to certain model misspecification and is practically useful for the integrative analysis of differential expression.
Keywords
 Mixture Model
 Data Integration
 Differential Expression Analysis
 Congenic Strain
 Simulation Configuration
Background
Microarray is an experimental method by which tens of thousands of genes can be printed on a small chip and their expression can be measured simultaneously [1, 2]. Since the microarray technology was introduced, it has been widely used in many biomedical studies [3, 4]. Microarrays can be used to measure expression for tens of thousands of genes at the mRNA level for samples in normal and disease groups, and then statistical methods for twosample comparison can be used to identify differentially expressed genes. Differentially expressed genes are potential disease related genes for clinical diagnoses and medical treatments. This approach has been successfully used in cancer studies [4, 5] as well as diabetes studies [6, 7].
Although microarray technology has been developed for more than a decade, the experiment cost is still considerably expensive. This limits the sample size of microarray studies. Therefore, the detection power can be low, especially when the signal of differential expression is relatively weak [8]. Many microarray data sets have been collected for the same or similar research purpose. Detecting genes with concordant behavior among different data sets is of biological interest. It is also of statistical interest to improve the detection power if it is feasible to integrate different data sets in differential expression analysis. For this reason, several methods have been proposed for data integration [9–14].
However, the genomewide concordance of different data sets has not been well considered in these integrative analyses. A gene selected for the followup analysis should behave concordantly in different data sets. For example, if a gene is upregulated in one experiment, then it should also be upregulated in another experiment. Slight inconsistency should be expected since there are considerable noises generated by microarray experiments. If two data sets are genomewide concordant, then integrating them can generally improve the sample size and reduce the noise impact. Therefore, it is desirable to combine observations of concordant genes since we expect to achieve a more powerful detection of differential expression. However, if two data sets are not genomewide concordant, then there are genes with discordant behavior in different data sets. There are many possible factors for such observations, such as population heterogeneity, probe binding issues from different microarray platforms, as well as labspecific system noises. Therefore, integrating observations of discordant genes may result in misleading conclusions and should be discouraged.
When a seemingly discordant behavior is observed for a gene, it is difficult to tell whether the observation is generated by random noises or the observation reflects the underlying truth. Therefore, it is not trivial to determine whether a gene has a concordant/discordant behavior in different experiments. The analysis will be more complicated for evaluating genomewide concordance. Cahan et al. [15] have studied different gene lists identified from different data sets. EinDor et al. [16] have showed that we may need to collect thousands of samples to generate a robust gene list for disease prediction. Miron et al. [17] have proposed a correlation based approach for measuring concordance between two lists of test statistics from two data sets. However, this approach does not consider the fact that different genes in a data set belong to different components (nondifferentially expressed, upregulated, downregulated, etc.). We have recently proposed a mixturemodel based method for testing genomewide concordance and discordance [18]. This approach considers the mixture of different gene components as well as the independence between two data sets, and it can be extended for data integration. In this study, we propose a mixturemodel based statistical framework to achieve a rigorous integrative analysis of differential expression.
In a recent study [19], it has been shown that the widely used overlap count (or Venn diagrams) is not an appropriate metric for measuring the reproducibility of differential expression analysis. It is necessary to develop new metrics for rigorously measuring the reproducibility of differential expression analysis. The disadvantage of overlap count metric is that the randomness of differential expression measures (e.g. ttest) has not been well considered. However, our mixturemodel based tests of genomewide concordance and discordance [18] take this randomness into account and the reported pvalues can be used as rigorous metrics for measuring the reproducibility of differential expression analysis.
For the rest of paper, we first introduce our statistical framework. Then, we use simulated data to evaluate its performance. Two experimental data based case studies are considered as the applications. Finally, we discuss the advantages and disadvantages of our method.
Methods
A statistical framework
 1.
In each data set, perform a statistical test of differential expression for each gene to obtain a lists of test scores;
 2.
For each list of test scores, perform a transformation procedure to obtain a list of zscores;
 3.
For two lists of zscores, test the complete discordance between them;
 (a)
if the complete discordance cannot be rejected, then the data integration will be discouraged;
 (b)
if the complete discordance can be rejected, then continue to the next step;
 4.
For two lists of zscores, test the complete concordance between them;
 (a)
if the complete concordance cannot be rejected, then calculate a list of concordant integrative scores based on the complete concordance (CC) model;
 (b)
if the complete concordance can be rejected, then calculate a list of concordant integrative scores based on the partial concordance/discordance (PCD) model;
 5.
Use the list of concordant integrative scores to prioritize genes for the followup study.
Test of differential expression
For simplicity, we consider the Student's twosample ttest for differential expression analysis. Other test statistics, such as Wilcoxon's rank sum test or a generalized t/Fstatistic [20], can certainly be considered. The statistical significance (pvalue) of a test value can be evaluated based on either its theoretical null distribution or a permutation null distribution [21]. In this study, the theoretical pvalue is used for the simulation study since we know the underlying distribution; the permutation pvalue is used for the application since the underlying distribution is unknown (B = 500 is used as the number of permutations).
Transformation of test score
It has been suggested transforming a test value to its associated zscore so that more efficient results can be achieved in a normal mixture model based analysis [22]. When the onesided (uppertailed) pvalue of a test value is available, the associated zscore can be simply calculated by
z = Φ^{1}(1  p),
where Φ^{1}(·) is the inverse of standard normal distribution. Notice that it is necessary to use onesided pvalues since we intend to distinguish upregulated differential expression from downregulated differential expression.
Mixture models
More details for these models have been described in our recent publication [18]. In these models, index 0 is used to represent the null component with fixed parameters: μ_{0} = ν_{0} = 0 and ${\sigma}_{0}^{2}={\tau}_{0}^{2}=1$; indices 1 and 2 are used to represent the downregulated and upregulated components with constrains: μ_{1}, ν_{1} ≤ 0 and μ_{2}, ν_{2} ≥ 0; π_{ ij }is the proportion of genes belonging to the ith component in the first data set and jth component in the second data set (Σ_{ ij }π_{ ij }= 1). π_{i.}is the marginal proportion of genes belonging to the ith component in the first data set; and π._{ j }is the marginal proportion of genes belonging to the jth component in the second data set. The model parameters can be estimated through an EM algorithm [23]. The detail has also been described in our recent publication [18].
Tests of concordance and discordance
The statistical significance of a test value can be evaluated by the parametric bootstrap procedure [24], which has also been described in our recent publication [18].
Data integration
False positive control
where S(·) is calculated based on an appropriate mixture model (CC or PCD). With this estimate, one may realize that the false discovery rate [25] (FDR) based on the qvalue concept [26] can be simply estimated as $\widehat{FP}$/K.
For an individual data set, we simply use the Rpackage qvalue [26] to obtain the estimated FDR. Then, the number of false positives for the top K genes can be simple estimated as K × FDR(K), which is theoretically consistent with the above $\widehat{FP}$.
Results and discussion
Illustrative examples
A simulation study
There are many parameters to be considered when we simulate microarray gene expression data:

Gene size m;

The proportion of nondifferentially expressed genes π_{0};

Sample sizes of two groups n_{1}, n_{2};

Distributions of expression measurements of differentially and nondifferentially expressed genes;

Covariance structure among genes.
In our simulation studies, we consider the widely used block structure: genes are partitioned into many blocks; genes within the same block are positively dependent; and different blocks are independent. To save the computing time, we reasonably set gene size m = 6000, π_{0} = 80% and n_{1} = n_{2} = 15 for each of two data sets. The block size (number of genes in each block) is set b = 25. Within each block, the expression measurements are simulated from a multivariate normal distribution. For blocks of nondifferentially expressed genes, we simulate expression measurements from N(${\widehat{\mu}}_{0}$, Σ_{0}), where ${\widehat{\mu}}_{0}$ is a b × 1 vector of 0's and Σ_{0} is a b × b matrix with diagonal entries as 1 and nondiagonal entries as a fixed value (simulated from a Uniform distribution U[0.5, 0.9]). For blocks of differentially expressed genes, we simulate expression measurements from N(${\widehat{\mu}}_{1}$, Σ_{1}) and N(${\widehat{\mu}}_{2}$, Σ_{2}) for the first and the second sample groups, respectively. ${\widehat{\mu}}_{2}$ is simply a b × 1 vector of 0's. To simulate ${\widehat{\mu}}_{1}$, we first simulate b × 1 vector of random numbers from a Beta distribution Beta(1.5, 1.5), multiply this vector by a factor r = 1.5, and then multiply randomly simulated signs (50% positive and 50% negative) so that both up and down regulated differential expression can be generated. Σ_{1} and Σ_{2} are similarly generated as Σ_{0}. Two data sets are first simulated based on the same configuration. To simulate genes with discordant behavior, we randomly reallocate ξ = 0%, 15%, 30%, 45%, 60% and 75% genes in the second data set so that these genes are no longer matched with those in the first data set. Notice that the simulation configuration is not completely consistent with what have been assumed for our method. This is intentionally designed to understand the robustness of our method. The complete concordance or the complete discordant will be rejected at the level pvalue < 0.025 since there is a issue of multiple hypothesis testing. To save computing time, we only perform the parametric bootstrap for 100 times to evaluate the pvalue of a test.
For each round of simulation, since we know the truth (simulation configuration), we can use the curve of number of concordantly differentially expressed genes (True Positives) against number of claimed ones (Claimed Positives) to evaluate the performance of our method. After many (50) rounds of simulations (it takes a long time for each round due to the parametric bootstrap procedure with the EM algorithm based estimation), we can take the average to obtain a smooth mean curve. Since the existing data integration methods [9–14] do not consider the issue of genomewide concordance/discordance, they are not included for comparison in this study. However, we have compared our method with the simple pooling approach: observations of the same gene from two data sets are simply combined for each sample group, and the ttest is applied to each gene in the pooled data set. This approach is feasible since the measurements in two simulated data sets are comparable. (Then, this approach is a desired efficient approach when two data sets are genomewide concordant. It is interesting to understand its loss of power when two sets are not genomewide concordant.)
Applications
A case study of partial concordance/discordance
Since NOD mouse spontaneously develops type 1 diabetes, it has been widely used for studying the disease. Based on a time course microarray study using samples collected from 3 weeks to 10 weeks, week 5 is a key checkpoint for the development of type 1 diabetes [7]. To distinguish the genes related to diabetes development from the genes related to aging, two other data sets have also been collected for two congenic strains: NOD.Idd3/Idd10 and NOD.B10SnH2, which do not spontaneously develop diabetes. Samples have been collected at different time points from 3 weeks to 10 weeks [7]. Although these two strains do not spontaneously develop type 1 diabetes, it is still interesting to understand their differential expression before 5 weeks vs. after 5 weeks. Furthermore, understanding genes with concordant/discordant behavior for these two strains is important. Therefore, the data set collected for each congenic strain is partitioned into two sample groups: for strain NOD.Idd3/Idd10, there are 11 and 13 subjects collected before 5 weeks and after 5 weeks, respectively; for strain NOD.B10SnH2, there are 22 and 10 subjects collected before 5 weeks and after 5 weeks, respectively. Measurements for 11,424 genes have been collected based on a cDNA microarray platform.
Application results. The parametric bootstrap based null quantiles are used to evaluate the significance (pvalues) of the tests of complete discordance and complete concordance between two NOD mouse data sets.
Quantile under Null  

Test  Observed  90%  95%  99%  99.5%  99.9% 
Complete Discordance (T_{CD})  2477.1  2.6  3.2  4.6  5.5  7.3 
Complete Concordance (T_{CC})  269.3  378.9  371.6  350.9  345.8  340.5 
Application results. The parameters in the PCD model for two NOD mouse data sets are estimated through an EM algorithm.
Data Set Two (NOD.B10SnH2)  

Down  Null  Up  Mean  Variance  
Data Set One  Down  0.150  0.078  0.001  2.424  1.279 
Null  0.076  0.488  0.018  0  1  
(NOD.Idd3/Idd10)  Up  0.002  0.079  0.109  2.685  1.493 
Mean  2.032  0  3.209  
Variance  1.333  1  0.777 
Figure 4c shows the estimated false discovery rates. Compared to the analysis based on individual data sets, the PCD model based false positive control is not necessarily better since we intend to detect these concordantly differentially expressed genes in the integrative data analysis. Furthermore, it is important that an appropriate model must be used for the data integration. Figure 4c also shows that the CC model based analysis results can be seriously misleading. Therefore, the tests of complete concordance and complete discordance are crucial before the data integration can be considered. Figure 4d–f compare the gene ranks based on the PCD model based integration and two individual data sets. The gene ranks based on two individual data sets are quite discordant: the Spearman's rank correlation is just 0.36. The integration based gene ranks and these based on two individual data sets are quite concordant: the Spearman's rank correlations are 0.78 and 0.72, respectively.
A case study of complete concordance
In practice, we may collect gene expression data for the same study from different laboratories based on different microarray platforms. These data usually cannot be directly combined for an analysis with a larger sample size. Our method can also be used to solve this problem. Although this situation has been discussed in our simulation study, it is still necessary to illustrate it with experimental data. Here, we generate a case of complete concordance based on an experimental data set. The data set was collected for a prostate cancer study [5]. Genomewide expression profiles for 6034 genes (after data preprocessing) have been measured for 50 normal and 52 cancerous subjects. We randomly split this data set into two subsets with equal sample sizes (25 normal and 26 cancerous subjects).
Application results. The parametric bootstrap based null quantiles are used to evaluate the significance (pvalues) of the tests of complete discordance and complete concordance between two NOD mouse data sets.
Quantile under Null  

Test  Observed  90%  95%  99%  99.5%  99.9% 
Complete Discordance (T_{CD})  1515.8  1.1  1.4  2.4  2.6  3.2 
Complete Concordance (T_{CC})  635.4  264.4  257.2  245.7  241.8  234.9 
Application results. The parameters in the CC model for two prostate cancer subsets are estimated through an EM algorithm.
Subset 2  

Down  Null  Up  Mean  Variance  
Subset 1  Down  0.157      2.290  0.987 
Null    0.273    0  1  
Up      0.570  1.685  0.629  
Mean  2.123  0  1.678  
Variance  0.924  1  0.547 
Figure 5c shows the estimated false discovery rates. Compared to the analysis based on individual data sets, the CC model based false positive control shows a clear improvement. However, the false positive control based on the original data (subsets 1 and 2 pooled together) is the best. This is consistent with our simulation results. Figure 5d–f compare the gene ranks based on the CC model based integration and two individual subsets. The gene ranks based on two individual subsets are quite discordant: the Spearman's rank correlation is just 0.50. The integration based gene ranks and these based on two individual data sets are quite concordant: the Spearman's rank correlations are both 0.81. Furthermore, the integration based gene ranks are highly concordant with these based on the original data (result not shown): the Spearman's rank correlation is 0.96.
Conclusion
In this study, we have proposed a statistical framework for integrating two microarray gene expression data sets in differential expression analysis. Our simulation and application results confirm that it is necessary to evaluate the genomewide concordance before the consideration of data integration. Otherwise, misleading results can be generated from the integrative analysis. Our current study focuses on the integration of two data sets with twosample groups. In our future study, we will generalize our method for multiple data sets. However, it is less straightforward to generalize our method for multisample groups since it is difficult to define the concordance/discordance for multiple groups.
Because of the randomness of data, we can always observe some intersection of genes selected from two data sets if the selection criterion is not stringent. This is the case even when two data sets are completed unrelated. (If the selection criterion is stringent, then we may always observe a null intersection even when two data sets are actually related.) Therefore, the genomewide concordance/discordance is a critical issue in the integrative analysis microarray data. The traditional hypergeometric analysis relies on the criterion of gene selection, which can be quite arbitrary in practice. For example, the results based on the threshold of 5%, 10% or 20% false discovery rates can be considerably different. It is not a rigorous approach to address the genomewide concordance/discordance. In a recent study [19], it has also been shown that the widely used overlap count (or Venn diagrams) is not an appropriate metric for measuring the reproducibility of differential expression analysis. Furthermore, it is not clear how to rank genes efficiently in the intersection of genes selected from two data sets.
To our knowledge, there is no other existing methods for evaluating genomewide concordance/discordance before the consideration of data integration. Our mixture model based approach is simple and intuitive. There are usually 3 major gene groups in a data set: upregulated, downregulated and null genes, which correspond to the three components in our model. The model inference is welldeveloped in the field of statistics. Furthermore, our model allows us to provide rigorous ranks for genes analyzed in two data sets. In our simulation study, our method can still provide a comparable performance in the situation of complete genomewide concordance when the ideal pooling approach is feasible. If two data sets are not completely concordant, then our method will provide a better performance.
Our method has several advantages. It allows us to test genomewide concordance/discordance, which is a critical issue before the data integration can be considered. It is a likelihoodbased approach, which is efficient when the underlying model is not seriously misspecified. We have also showed the robustness of our method through a simulation study when the underlying models are somewhat inconsistent. Furthermore, the data integration is achieved through a rigorously defined probability with close formulas.
Our method also has the following disadvantages. It is difficult to validate the assumed mixture model. However, without this assumption, we currently have no effect approach for evaluating genomewide concordance/discordance. Furthermore, the calculation of likelihood assumes that the test scores from different genes are independent. However, it is wellknown that the covariance structure of a microarray gene expression data set can be complicated. In our future study, we will explore more efficient approaches to overcome these disadvantages.
Declarations
Acknowledgements
The Rcode is freely available at the author's website [27]. This work was partially supported by NIH grants DK75004 (Y. Lai), HD37800 (JX. She) and HD50196 (JX. She).
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 1, 2009: Proceedings of The Seventh Asia Pacific Bioinformatics Conference (APBC) 2009. The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/10?issue=S1
Authors’ Affiliations
References
 Schena M, Shalon D, Davis RW, Brown PO: Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995, 270: 467470. 10.1126/science.270.5235.467.View ArticlePubMedGoogle Scholar
 Lockhart D, Dong H, Byrne M, Follettie M, Gallo M, Chee M, Mittmann M, Wang C, Kobayashi M, Horton H, Brown E: Expression monitoring by hybridization to highdensity oligonuleotide arrays. Nature Biotechnology. 1996, 14: 16751680. 10.1038/nbt12961675.View ArticlePubMedGoogle Scholar
 Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycleregulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell. 1998, 9: 32733297.PubMed CentralView ArticlePubMedGoogle Scholar
 Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999, 286: 531537. 10.1126/science.286.5439.531.View ArticlePubMedGoogle Scholar
 Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D'Amico AV, Richie JP, Lander ES, Loda M, Kantoff PW, Golub TR, Sellers WR: Gene expression correlates of clinical prostate cancer behavior. Cancer Cell. 2002, 1: 203209. 10.1016/S15356108(02)000302.View ArticlePubMedGoogle Scholar
 Wilson KHS, Eckenrode SE, Li QZ, Ruan QG, Yang P, Shi JD, DavoodiSemiromi A, Mclndoe RA, Croker BP, She JX: Microarray analysis of gene expression in the kidneys of new and postonset diabetic NOD mice. Diabetes. 2003, 52: 21512159. 10.2337/diabetes.52.8.2151.View ArticlePubMedGoogle Scholar
 Eckenrode SE, Ruan Q, Yang P, Zheng W, Mclndoe RA, She JX: Gene expression profiles define a key checkpoint for type 1 diabetes in NOD mice. Diabetes. 2004, 53: 366375. 10.2337/diabetes.53.2.366.View ArticlePubMedGoogle Scholar
 Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E, Houstis N, Daly MJ, Patterson N, Mesirov JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop L: PGC1αresponse genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genetics. 2003, 34: 267273. 10.1038/ng1180.View ArticlePubMedGoogle Scholar
 Choi JK, Yu U, Kim S, Yoo OJ: Combining multiple microarray studies and modeling interstudy variation. Bioinformatics. 2003, 19 (Supplement 1): i8490. 10.1093/bioinformatics/btg1010.View ArticlePubMedGoogle Scholar
 Xu L, Tan AC, Naiman DQ, Geman D, Winslow RL: Robust prostate cancer marker genes emerge from direct integration of interstudy microarray data. Bioinformatics. 2005, 21: 39053911. 10.1093/bioinformatics/bti647.View ArticlePubMedGoogle Scholar
 Conlon EM, Song JJ, Liu JS: Bayesian models for pooling microarray studies with multiple sources of replications. BMC Bioinformatics. 2006, 7: 24710.1186/147121057247.PubMed CentralView ArticlePubMedGoogle Scholar
 Hong F, R RB: A comparison of metaanalysis methods for detecting differentially expressed genes in microarray experiments. Bioinformatics. 2008, 24: 374382. 10.1093/bioinformatics/btm620.View ArticlePubMedGoogle Scholar
 Xu L, Tan AC, Winslow RL, Geman D: Merging microarray data from separate breast cancer studies provides a robust prognostic test. BMC Bioinformatics. 2008, 9: 12510.1186/147121059125.PubMed CentralView ArticlePubMedGoogle Scholar
 Borozan I, Chen L, Paeper B, Heathcote JE, Edwards AM, Katze M, Zhang Z, McGilvray ID: MAID: An effect size based model for microarray data integration across laboratories and platforms. BMC Bioinformatics. 2008, 9: 30510.1186/147121059305.PubMed CentralView ArticlePubMedGoogle Scholar
 Cahan P, Ahmad AM, Burke H, Fu S, Lai Y, Florea L, Dharker N, Kobrinski T, Kale P, McCaffrey TA: List of listsannotated (LOLA): a database for annotation and comparison of published microarray gene lists. Gene. 2005, 360: 7882. 10.1016/j.gene.2005.07.008.View ArticlePubMedGoogle Scholar
 EinDor L, Zuk O, Domany E: Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proceedings of the National Academy of Sciences USA. 2006, 103: 59235928. 10.1073/pnas.0601231103.View ArticleGoogle Scholar
 Miron M, Woody OZ, Marcil A, Murie C, Sladek R, Nadon R: A methodology for global validation of microarray experiments. BMC Bioinformatics. 2006, 7: 33310.1186/147121057333.PubMed CentralView ArticlePubMedGoogle Scholar
 Lai Y, Adam BL, Podolsky R, She JX: A mixture model approach to the tests of concordance and discordance between two large scale experiments with twosample groups. Bioinformatics. 2007, 23: 12431250. 10.1093/bioinformatics/btm103.View ArticlePubMedGoogle Scholar
 Zhang M, Yao C, Guo Z, Zou J, Zhang L, Xiao H, Wang D, Yang D, Gong X, Zhu J, Li Y, Li X: Apparently low reproducibility of true differential expression discoveries in microarray studies. Bioinformatics. 2008,Google Scholar
 Cui X, Hwang JTG, Qiu J, Blades NJ, Churchill GA: Improved statistical tests for differential gene expression by shrinking variance components estimates. Biostatistics. 2005, 6: 5975. 10.1093/biostatistics/kxh018.View ArticlePubMedGoogle Scholar
 Dudoit S, Shaffer JP, Boldrick JC: Multiple hypothesis testing in microarray experiments. Statistical Science. 2003, 18: 71103. 10.1214/ss/1056397487.View ArticleGoogle Scholar
 McLachlan GJ, Bean RW, Jones LB: A simple implementation of a normal mixture approach to differential gene expression in multiclass microarrays. Bioinformatics. 2006, 22: 16081615. 10.1093/bioinformatics/btl148.View ArticlePubMedGoogle Scholar
 McLachlan GJ, Krishnan T: The EM algorithm and extensions. 1997, John Wiley & Sons, IncGoogle Scholar
 McLachlan GJ: On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. Applied Statistics. 1987, 36: 318324. 10.2307/2347790.View ArticleGoogle Scholar
 Benjamini Y, Hochberg Y: Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B. 1995, 57: 289300.Google Scholar
 Storey JD, Tibshirani R: Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences, USA. 2003, 100: 94409445. 10.1073/pnas.1530509100.View ArticleGoogle Scholar
 Web link for Rcode. [http://home.gwu.edu/~ylai/research/Concordance]
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.