 Methodology article
 Open Access
Evaluation of fecal mRNA reproducibility via a marginal transformed mixture modeling approach
 Nysia I George^{1},
 Joanne R Lupton^{2},
 Nancy D Turner^{2},
 Robert S Chapkin^{2},
 Laurie A Davidson^{2} and
 Naisyin Wang^{3}Email author
https://doi.org/10.1186/147121051113
© George et al; licensee BioMed Central Ltd. 2010
 Received: 1 April 2009
 Accepted: 7 January 2010
 Published: 7 January 2010
Abstract
Background
Developing and evaluating new technology that enables researchers to recover geneexpression levels of colonic cells from fecal samples could be key to a noninvasive screening tool for early detection of colon cancer. The current study, to the best of our knowledge, is the first to investigate and report the reproducibility of fecal microarray data. Using the intraclass correlation coefficient (ICC) as a measure of reproducibility and the preliminary analysis of fecal and mucosal data, we assessed the reliability of mixture density estimation and the reproducibility of fecal microarray data. Using Monte Carlobased methods, we explored whether ICC values should be modeled as a betamixture or transformed first and fitted with a normalmixture. We used outcomes from bootstrapped goodnessoffit tests to determine which approach is less sensitive toward potential violation of distributional assumptions.
Results
The graphical examination of both the distributions of ICC and probittransformed ICC (PTICC) clearly shows that there are two components in the distributions. For ICC measurements, which are between 0 and 1, the practice in literature has been to assume that the data points are from a betamixture distribution. Nevertheless, in our study we show that the use of a normalmixture modeling approach on PTICC could provide superior performance.
Conclusions
When modeling ICC values of gene expression levels, using mixture of normals in the probittransformed (PT) scale is less sensitive toward model misspecification than using mixture of betas. We show that a biased conclusion could be made if we follow the traditional approach and model the two sets of ICC values using the mixture of betas directly. The problematic estimation arises from the sensitivity of betamixtures toward model misspecification, particularly when there are observations in the neighborhood of the the boundary points, 0 or 1. Since betamixture modeling is commonly used in approximating the distribution of measurements between 0 and 1, our findings have important implications beyond the findings of the current study. By using the normalmixture approach on PTICC, we observed the quality of reproducible genes in fecal array data to be comparable to those in mucosal arrays.
Keywords
 Intraclass Correlation Coefficient
 Beta Distribution
 Mixture Modeling Approach
 High Intraclass Correlation Coefficient
 Estimate Density Function
Background
Microarray techniques have changed the practice of detecting messenger RNA (mRNA) expression of a single gene to the current stage of simultaneously measuring the expression of thousands of genes. Daily improvement in this technology also stimulates techniques that lead to new bioassays. Among them, and of particular interest, is a recent development that enables the collection of genomic information from exfoliated colonocytes in fecal matter. It is known that early detection of cancerous colon cells results in high cure and survival rates among colon cancer patients. However, people tend to shy away from invasive procedures such as the colonoscopy. Consequently, it is of great interest to develop noninvasive early detection instruments. Although evidence exists in the fecal platform that partially degraded mRNA in fecal samples can produce meaningful measurements[1], and Davidson et al. [2] and Kanaoka et al. [3] suggest that it is possible to isolate intact fecal eukaryotic mRNA, it is unknown whether one can expect the same quality from the large amount of fecal microarray data. The current study, to the best of our knowledge, is the first one that investigates and reports the reproducibility of fecal microarray data. In a proofofprinciple study conducted by human nutrition scientists at Texas A&M University, one main task is to find out whether one can expect the same level of reproducibility in the fecal platform as that observed in the mucosal platform where biological samples were taken from colon cells. Because of biological variation, two gene expression values of the same gene taken from the same subject are most likely not the same. In order to determine if one can successfully obtain the same findings when an experiment is repeated, it is important to investigate whether the gene expression levels of a gene from the same subject behave more similarly to each other than to those of the same gene from different subjects. The signal is strongest and the reproducibility is highest when the outcomes can be perfectly repeated when a different set of measurements are taken from the same subjects. It is expected that due to mRNA degradation, a larger proportion of genes in the fecal platform would possess no or lower reproducibility than those in the mucosal platform. However, it is of interest to understand the quality of those genes which are not degraded in the fecal platform.
Generally, replicates are samples collected from the same subject that are processed separately and independently after sample collection. Our replicates differ because the "same" biological samples are separately processed only right before the hybridization. The former "replicates" are often collected to evaluate the quality of microarray techniques, while we are truly interested in biological reproducibility at the subject level. This subtle difference is particularly important; some genes could be preserved in one sample but are degraded in another even when both samples are from the same subject. It is the genes with low possibility to be degraded that we are interested in. While we focus only on subject to subject variation, we acknowledge that there are other types of replication in gene expression data[4].
In order to assess the agreement between measurements from microarray data collected from the same subject, we use the intraclass correlation coefficient (ICC) as a reliability index. The use of ICC in genomic study was promoted by Carrasco and Jover[5].
Under each platform, we compute a single ICC value for each gene. One key advantage of ICC as a statistical tool for evaluating reproducibility for different platforms/instruments is that it does not require two platforms/instruments to be evaluated under the same treatment design. In most biological experiments, researchers tend to conduct the second experiment with modifications and improvements rather than simply to repeat what has been done before. Consequently, a statistical tool for evaluating reproducibility has to have the flexibility to accommodate this common practice. In order to fulfill this requirement, the ICC values were computed after removing the treatment effects. The single index recorded per gene uses variance components analysis to compare the measurementsimilarity for samples taken from the same subject/rat versus the measurementsimilarity for samples taken from different subjects/rats. We report the methodology for calculating ICC in the Methods subsection.
The larger the value of ICC, the more differentiation among measurements collected from different biological samples relative to that among readings collected from the same biological material. An ICC value near 1 signifies a strong indication of reproducibility and agreement between experiments. If the ICC is near 0, then withinsubject variance is relatively large compared to betweensubject variance and it is likely that one cannot obtain the same expression level in a repeated experiment.
In both the mucosal and the fecal genes, we observe at least a small proportion of genes that always have low reproducibility; their existence results in a mixture model for the distribution of ICC values. It is common practice to use finite mixture modeling in bioinformatics research. The reasons tend to be twofold: to accommodate measurement heterogeneity and to identify potentially meaningful subgroups. The most popular approach is the use of finite normal mixtures [6–9]. Allison et al. and Ji et al. use betamixture modeling to describe distributional properties of different genes' correlation coefficients[10, 11]. Like measurements of ICC, the values of correlation coefficients are between 0 and 1. For the same type of data, McLachlan et al. prefer the use of normalmixture distributions which eliminates the (0,1)range constraint[8].
In a study comparing the fecal and mucosal bioassay platforms, we obtained different proportions for the mixture components when we modeled the probit transformed ICC (PTICC) values with a twocomponent normalmixture distribution and when we modeled the ICC values with a twocomponent betamixture distribution. It was our conjecture that, considering the boundary problem of beta distribution modeling, the normalmixture modeling might be less sensitive toward model misspecification. We observed the lower component of the beta mixture to be strictly decreasing with the density f(yα,β) approaching infinity as y approaches 0. This phenomenon likely caused the maximum likelihood estimate (MLE) of βparameters to be unstable. We conduct a sequence of numerical studies to compare the two approaches.
Our ultimate goal is to select the better of the two systems to ascertain whether the "reproducible" component in the fecal array samples share similar properties to those of the mucosal array samples.
Results
Data sets
Gene expression levels from the colon mucosal and fecal data samples were collected using CodeLink microarrays (30 oligonucleotide target probe, single color labeling system). The main dataset under study here consisted of 2171 genes for the fecal data and 2241 genes for the mucosal data. Due to the fact that the bioassays that were used to extract fecal mRNA were developed later, the mucosal data we used were collected much earlier in a different experiment. In fact, we did not have access to the original muscosal dataset. We were able to use the available summary statistics to produce ICC measurements. All measurements (fecal and mucosal) were collected from Spraque Dawley rats.
Fecal Data
The fecal array data were collected from rat fecal samples in a study designed to explore the effect that diet has on genes being differentially expressed after exposure to carcinogen/radiation. A normalization procedure was developed[12]. Rats in the study were exposed to carcinogen azoxymethane (AOM) and randomly assigned to one of four different treatments resulting from a 2 × 2 factorial design. The two experimental factors were diet  fish oil/pectin (D1) and corn oil/cellulose (D2), and radiation  with radiation exposure (IRT) and without radiation exposure (RCT). Fecal samples were collected 14 weeks after the last exposure to carcinogen AOM. There were 7, 6, 8, and 7 bioarrays collected under IRTD1, IRTD2, RCTD1, and RCTD2, respectively. Genes that were not disqualified with at least 3 usable replicates were kept.
Mucosal Data
Rats used in the study to obtain mucosal array data were randomly assigned in a 3 × 2 × 2 factorial experiment to a treatment with diet, exposure, and time points as factors[13]. Corn oil/n6 polyunsaturated fatty acid (PUFA) or fish oil/n3 PUFA or olive oil/n9 monounsaturated fatty acid (MUFA) was used as the dietary fat source; carcinogen AOM or saline was used as the exposure source; time points were either 12 hours or 10 weeks after the first injection. The units were terminated at the appropriate time point in order to remove the mucosal layer from each colon so that RNA could be extracted from the mucosal samples. The numbers of arrays for corn, fish, and olive oil diets under AOM or saline treatments were (7, 7, 6) and (7, 6, 7), respectively for the 12hour study and were (12, 10, 8) and (7, 9, 7), respectively for the 10week study.
Matched Subset
To address the issue of reproducibility for a finite list of common genes between the platforms, we conducted an additional study referred to as the "matched subset" throughout. We were able to retrieve the NCBI gene information from the mucosal experiments and used them to create a matched subset in which the two subsets (fecal and mucosal) were collected from the same genes. Each subset contains 1029 measurements.
Preliminary Application to the Main Dataset
Monte Carlo Assessments
To investigate the sensitivity of each of the two mixture modeling approaches to distributional misspecification, we conduct Monte Carlo simulation studies to mimic what we observed in the fecal and mucosal microarray data sets. Simulation for the fecal data is described as follows:
Simulation scenario #1: Data Generated from Betamixtures, Fit with Normalmixtures
(1) Generate Y_{1}, ..., Y_{ n }from = 0.7 Beta(2.6, 1.7) + 0.3 Beta(0.2, 0.8).
(2) Transform Y_{1}, ..., Y_{ n }using the probit transformation and fit the PTICC measurements with a twocomponent normalmixture model.
Simulation scenario #2: Data Generated from Normalmixtures, Fit with Betamixtures
(1) Generate X_{1}, ..., X_{ n }from = 0.7 N (0.04, 0.8) + 0.3 N (3.5, 0.07).
(2) Transform X_{1}, ..., X_{ n }using the inverse probit transformation and fit the ICC data with a twocomponent betamixture model.
We repeated each simulation s = 250 times for sample size n = 1600 and used the EM algorithm to obtain the estimates of corresponding parameters. The steps above were repeated for the mucosal dataset where the beta random variables were generated from = 0.8 Beta(2.3, 2.3) + 0.2 Beta(0.3, 1.3) and the normal random variables were generated from = 0.8 N (0.3, 0.6) + 0.2 N (3.3, 0.1).
Summary Statistics of Simulation Scenario #1
Data Generated from Betamixtures, Fit with Normalmixtures  

Dataset 




 
Fecal  Truth  0.700  0.328  0.446  1.771  3.330 
Mean  0.725  0.302  0.440  1.951  3.321  
Bias  0.025  0.026  0.006  0.180  0.009  
Std Dev  0.018  0.023  0.028  0.152  0.283  
MSE  0.031  0.035  0.029  0.235  0.283  
Mucosal  Truth  0.800  0.033  0.391  2.090  2.722 
Mean  0.816  0.049  0.398  2.254  2.823  
Bias  0.016  0.016  0.007  0.164  0.101  
Std Dev  0.015  0.022  0.022  0.157  0.272  
RMSE  0.022  0.027  0.023  0.227  0.290 
Summary Statistics of Simulation Scenario #2
Data Generated from Normalmixtures, Fit with Betamixtures  

Dataset 




 
Fecal  Truth  0.700  0.328  0.446  1.771  3.330 
Mean  0.453  0.282  0.521  1.995  3.409  
Bias  0.247  0.046  0.075  0.224  0.079  
Std Dev  0.010  0.036  0.032  0.050  0.138  
RMSE  0.247  0.059  0.082  0.229  0.159  
Mucosal  Truth  0.800  0.033  0.391  2.090  2.722 
Mean  0.527  0.149  0.387  1.691  2.546  
Bias  0.273  0.116  0.004  0.399  0.176  
Std Dev  0.011  0.031  0.023  0.049  0.111  
RMSE  0.273  0.120  0.023  0.402  0.208 
for fecal (mucosal) data using 5, 8, and 12 bins
True  

Fit  Beta  Normal  
5  Beta  0.12 (0.08)  0.98 (0.01) 
Normal  0.13 (0.09)  0.36 (0.01)  
8  Beta  0.00 (0.01)  1.00 (1.00) 
Normal  0.00 (0.01)  0.04 (0.02)  
12  Beta  0.02 (0.01)  1.00 (1.00) 
Normal  0.02 (0.00)  0.03 (0.01) 
ICC Comparisons of Fecal and Mucosal Data
Since our findings from the simulation studies suggested that we use a twocomponent normalmixture to fit the probit transformed ICC values, we adopted this strategy and utilized it to compare reproducibility under the fecal and mucosal array platforms. We associate the two components of high (and low) ICC values with reproducible (and irreproducible) genes; see the Discussion subsection for more considerations.
We also let, for the fecal and mucosal data, π_{ LF }and π_{ LM }be the proportions of the mixture components consisting of irreproducible genes, and μ_{ UF }and μ_{ UM }be the means of the mixture component with higher ICC values. We reported two main studies that were conducted for the purpose of exploring the extent of the distributional differences between the two platforms. Throughout, we used bootstrap methods described in the Methods subsection. The first bootstrap analysis is designed to find the 95% confidence interval for the difference in the proportion of irreproducible genes contained in each data set, π_{ LF } π_{ LM }. In the second analysis, we identify the 95% confidence interval for the average difference in the mixture components with higher ICC values, μ_{ UF } μ_{ UM }. The bootstrapped 95% confidence intervals for the two studies were (0.06,0.10) for π_{ LF } π_{ LM }, and (0.27,0.40) for μ_{ UF } μ_{ UM }. As a result, we concluded that while the fecal array had a higher proportion of irreproducible genes, its average ICC values for the reproducible component of genes was a little higher than that obtained from the mucosal platform.
Outcomes for Analysis of Matched Subset
Discussion
There are a few points worth making here. The key problem behind the instability of betamixture modeling is that one might attempt to estimate the worst component of the mixture distribution with a small proportion of data observed on the boundary. The specifics of simulation scenarios #1 and #2 were based on our analysis of the original subset of ICC values. We expect the same difficulties would be encountered in the betamixture modeling if we have a high density of ICC values close to 1 at the upper component. To investigate this conjecture, we conducted an additional simulation study and report the outcomes in the "Additional File 1." We found that the betamixture less accurately fit the transformed normal data when the mixture had a high density of values near 1. However, the betamixture had no problems fitting transformed normal data resulting from a betamixture with no asymptotes at the boundary. There was less distinction between the quality of the fits when the normalmixture was used to fit PTICC data. Again, suggesting that twocomponent normalmixture modeling on PTICC is a more reliable approach.
Although it is not obvious to interpret the meaning of the estimated parameters, from the normal mixture modeling in Figures 3 and 4, the cutoff between the two mixture components is around 2. This roughly corresponds to the scenario of an ICC = 5%. By pure randomness, even though the true correlation could be zero, one could observe a nonzero sample correlation of 5% or less. From our numerical analyses on the fecal microarray data, the proportion of ICC values less than 5% range from 20% to 28%. The proportion of genes with ICC values less than 5% for the fecal and mucosal samples are 25% and 20%, respectively in the main study, and are 22% and 18%, respectively for the matched study. These numbers again match better with the outcomes from the normalmixture modeling.
Finally, we conducted another simulation study using the estimated parameters from the matched subset. The exact setup and outcomes are reported in "Additional File 2." For the mucosal subset of ICC values, we find equivalent results between the betamixture approach and the normalmixture approach. However, results from the simulation study show unsatisfactory performances under the scenario of "Data Generated from Normalmixtures, Fit with Betamixtures". Our mucosal matched subset is most likely betamixture distributed.
Conclusion
In this study we have demonstrated that when analyzing ICC values of gene expression levels, it is a better strategy to first probittransform the ICC values onto the (8, 8) domain and then to model the PTICC values with a normalmixture model. Through this practice, we were able to obtain outcomes that were less sensitive toward distributional assumptions. We avoided the problem of estimating parameters for a beta distribution which increases to infinity at the boundary. Our investigations suggested that even though there tended to be a higher proportion of genes that had low reproducibility in the fecal array data than in the mucosal array data, the average ICC values for those genes which possessed relatively high ICC values in the fecal data was even a bit higher than the corresponding average observed in the mucosal platform. We also note that the probit transformation strategy enables us to easily adopt the mixture of normal modeling approach that can be carried out by MCLUST packages in R or Splus.
Methods
Obtaining ICC Values for Genes on a Microarray Chip
where G is the number of genes.
The Probit Transformation
For X in the range of (0,1), the probit transformed values, Y, of X, are defined as Y = Φ^{1}(X), thereby converting (0,1) values to the real line.
Twocomponent mixture models
The numerical investigations of ICC and PTICC values clearly show that the data comes from a mixture of two populations. When data is modeled by a mixture of two distributions we postulate it as though an observation comes from distribution 1 with probability p and from distribution 2 with probability 1  π .
Parameter estimation using expectationmaximization (EM) algorithm
The expectationmaximization (EM) algorithm[15] is an iterative approach for estimation of incomplete data problems. Given starting values of the model parameters, the EM algorithm iteratively updates the estimates until a specified convergence is reached.
Mixture of Betas
Ji et al. [11] advocate modeling correlation coefficients with betamixtures and outline the subsequent EM algorithm. Suppose y_{1}, . . . y_{ n }are n independent observations from f_{ Y }(yθ_{ B }), where f_{ Y }is the density of a beta distribution and θ_{ B }= (π, α_{1}, α_{2}, β_{1}, β_{2}). Let the random vector X = (Z, Y) = {z_{ i }, y_{ i }}, where z_{ i } is a 01 indicator variable that tells which distribution, the first or the second, the i th observation comes from.
where the superindex, k, denotes the estimates at the k th iteration.
and obtain the maximum likelihood estimates of , , , and accordingly. The E and Msteps are iterated until the convergence criteria is met.
The starting values for α_{1}, α_{2}, β_{1}, and β_{2} were set to 0.01 and {z_{ i }} was initialized by setting one half of the indicator variables equal to 0 and the other half equal to 1 so that = 0.50. We utilized the 'optim' function in R to obtain parameter estimates for the two beta density functions. The procedure was repeated until we observed a negligible change in the value of the loglikelihood given in (7).
Mixture of Normals
Let x_{1}, ..., x_{ n }be n iid observations from f_{ X }(xθ_{ N }), where f_{ X }is the density of a normal distribution and . In order to estimate the parameters for a twocomponent normal mixture, we use the MCLUST software package for R[16]. MCLUST implements the EM algorithm, equivalent to what what was described for the mixture of betas to carry out the computations of a maximum likelihood approach for normalmixture models. For model selection, MCLUST determines the number of clusters and the clustering model by maximizing the Bayesian Information Criterion (BIC)[17]. See[16, 18] for more details regarding the MCLUST software package.
Distribution of transformed random variables
Generate from Beta, Fit with Normal
Generate from Normal, Fit with Beta
Chisquare goodness of fit
where O_{ i }and E_{ i }are the observed and expected, respectively frequencies for bin i.
To ensure that the expected frequency count is never zero at the tails, we let the first and the last bins to be {xx < X_{(0.025)}} and {xx = X_{(0.975)}}, respectively where X_{(0.025)} and X_{(0.975)} are the 2.5th and 97.5th percentiles of the data rounded up and down to the nearest whole numbers. The equal distance bins correspond to the disjoint intervals in between.
If a dataset is fit with a mixture of normal distributions, then the density function defined in (10) is used to determine the expected frequencies. Likewise, we use (9) to calculate expected frequencies when a dataset is fit with a mixture of betas.
Bootstrap Analysis
 1.
Generate bootstrap samples of size n_{1} and n_{2} by sampling with replacement from the original n_{1} observations of fecal and n_{2} observations of mucosal ICC values.
 2.
Use MCLUST to estimate the parameters of a twocomponent normalmixture fitted to each bootstrap sample.
 3.
Compute .
 4.
Repeat steps 1 through 3 for I = 299 times, computing .
Once the are obtained, a (1  α)% bootstrap confidence interval is defined by , where (α/2) and (1  α/2) are the α/2 and (1  α/2) percentiles of . If we let μ_{ UF }and μ_{ UM }be the means of the reproducible genes for the fecal and mucosal datasets, then the process for constructing a bootstrap confidence interval for μ_{ UF } μ_{ UM }mimics the above procedure, replacing step 3 with "Compute .
Declarations
Acknowledgements
This work is supported by grants from US NCI CA74552, CA59034, CA129444 and NSBRI (NASA NCC 958).
Authors’ Affiliations
References
 Schoor O, Weinschenk T, Hennenlotter J, Corvin S, Stenzl HG, Aand Rammensee, Stevanović S: Moderate degragradation does not preclude micoarray analysis of small amounts of RNA. BioTechniques 2003, 35: 1192–1201.PubMedGoogle Scholar
 Davidson L, Lupton J, Miskovsky E, Fields A, Chapkin R: Quantification of human intestinal gene expression profiles using exfoliated colonocytes: a pilot study. Biomarkers 2003, 8: 51–61. 10.1080/1354750021000042268View ArticlePubMedGoogle Scholar
 Kanaoka S, I YK, Miura N, Sugimura H, Kajimura M: Potential usefulness of detecting cyclooxygenase 2 messanger RNA in feces for colorctal cancer screening. Gastroenterology 2004, 127: 422–427. 10.1053/j.gastro.2004.05.022View ArticlePubMedGoogle Scholar
 Nguyen D, Arpat A, Wang N, Carroll R: DNA microarray experiments: biological and technological aspects. Biometrics 2002, 58: 701–717. 10.1111/j.0006341X.2002.00701.xView ArticlePubMedGoogle Scholar
 Carrasco J, Jover L: Estimating the generalized concordance correlation coefficient through varince components. Biometrics 2002, 59: 849–858. 10.1111/j.0006341X.2003.00099.xView ArticleGoogle Scholar
 Pan W, Lin J, Le JT: A mixture model approach to detecting differentially expressed genes with microarray data. Functional & Integrative Genomics 2003, 3: 117–124.View ArticleGoogle Scholar
 Dean N, Raftery AE: Normal uniform mixture differential gene expression detection for cDNA microarrays. BMC Bioinformatics 2005, 6: 173–187. 10.1186/147121056173View ArticlePubMedPubMed CentralGoogle Scholar
 McLachlan G, Bean R, BenTovin Jones L: A simple implementation of a normal mixture approach to differential gene expression in multiclass microarrays. Bioinformatics 2006., 22(13): 10.1093/bioinformatics/btl148Google Scholar
 Ghosh D, Chinnaiyan AM: Genomic outlier profile analysis: mixture models, null hypotheses, and nonparametric estimation. Biostatistics 2009, 10: 60–69. 10.1093/biostatistics/kxn015View ArticlePubMedPubMed CentralGoogle Scholar
 Allsion D, Gadbury G, Heo M, Fernandez J, Lee C, Prolla T, Weindruch R: A mixture model approach for the analysis of microarray gene expression data. Computational Statistics and Data Analysis 2002, 39: 1–20. 10.1016/S01679473(01)000469View ArticleGoogle Scholar
 Ji Y, Wu C, Liu P, Wang J, Coombes K: Applications of betamixture models in bioinformatics. Bioinformatics 2005, 21(9):2118–2112. 10.1093/bioinformatics/bti318View ArticlePubMedGoogle Scholar
 Liu L, Wang N, Lupton J, Turner N, Chapkin R, Davidson L: A twostage normalization method for partially degraded mRNA microarray data. Bioinformatics 2005, 21: 4000–4006. 10.1093/bioinformatics/bti661View ArticlePubMedGoogle Scholar
 Davidson L, Nguyen D, Hokanson R, Callway E, Isett R, Turner N, Dougherty E, Wang N, Lupton J, Carroll R: Chemopreventive n 3 polyunsaturated fatty acids reprogram genetic signatures during colon cancer initiation and progression in the rat. Cancer Research 2004, 64: 6797–684. 10.1158/00085472.CAN041068View ArticlePubMedPubMed CentralGoogle Scholar
 Finney D: Probit Analysis. 3rd edition. Cambridge, UK: Cambridge University Press; 1971.Google Scholar
 Dempster A, Laird N, Rubin D: Maximum likelihood for incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society Series B 1977, 39: 1–38.Google Scholar
 Fraley C, Raftery A: Software for modelbased cluster analysis and discriminant analysis. In Tech Rep 342. University of Washington; 1999.Google Scholar
 Schwartz G: Estimating the dimension of a model. The Annals of Statistics 1978, 6(2):461–464. 10.1214/aos/1176344136View ArticleGoogle Scholar
 Fraley C, Raftery A: Modelbased clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association 2002, 97(458):611–631. 10.1198/016214502760047131View ArticleGoogle Scholar
 Efron B, Tibshirani R: An Introduction to the Bootstrap. London: Chapman and Hall; 1973.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.