MetSizeR: selecting the optimal sample size for metabolomic studies using an analysis based approach
© Nyamundanda et al.; licensee BioMed Central Ltd. 2013
Received: 7 June 2013
Accepted: 13 November 2013
Published: 21 November 2013
Determining sample sizes for metabolomic experiments is important but due to the complexity of these experiments, there are currently no standard methods for sample size estimation in metabolomics. Since pilot studies are rarely done in metabolomics, currently existing sample size estimation approaches which rely on pilot data can not be applied.
In this article, an analysis based approach called MetSizeR is developed to estimate sample size for metabolomic experiments even when experimental pilot data are not available. The key motivation for MetSizeR is that it considers the type of analysis the researcher intends to use for data analysis when estimating sample size. MetSizeR uses information about the data analysis technique and prior expert knowledge of the metabolomic experiment to simulate pilot data from a statistical model. Permutation based techniques are then applied to the simulated pilot data to estimate the required sample size.
The MetSizeR methodology, and a publicly available software package which implements the approach, are illustrated through real metabolomic applications. Sample size estimates, informed by the intended statistical analysis technique, and the associated uncertainty are provided.
In many metabolomic experiments, one of the most important objectives is to discover the set of metabolites that plays a significant role in distinguishing samples from two different groups or populations and thus, in the identification of novel biomarkers . As in any experiment, designing the experiment is critical if reliable statistically significant metabolites are to be obtained. Since metabolomic experiments are expensive, it is crucial to determine the optimal sample size to attain the desired power to identify discriminating metabolites without wasting resources or unnecessarily sampling many subjects. However, metabolomic data are typically high dimensional and correlated meaning sample size estimation using classical statistical methods is not straight forward.
Currently, in the metabolomics literature, there is no standard method for the determination of sample size when designing a metabolomic experiment. Several methods currently exist in the literature for sample size selection in the high-dimensional data setting [2-5]. However, none of these methods are suitable for metabolomic experiments since many either assume variables have equal variance or are independent. More importantly, these methods rely on the existence of some experimental pilot data on which the actual sample size selection is then based, and are not based on the method to be used to analyze the data. In metabolomic studies, experimental pilot data are rarely available, making such sample size selection approaches redundant.
Intuitively the MetSizeR method can be naturally extended to include other analysis approaches, assuming they are based on a statistical model rather than being non-parametric in nature.
MetSizeR draws on two currently existing methods (see  and ) for sample size calculation in high-dimensional data settings. While the approach in  is based on the existence of an experimental pilot data set, the approach detailed in  simulates pilot data from a statistical model. Further, while independence in the data is assumed in , the approach in  uses permutation methods to account for the correlation in the experimental pilot data. MetSizeR combines these ideas of prior simulation and permutation based techniques to estimate the sample size for metabolomic experiments. The main advantage of the developed approach is its ability to determine sample size without experimental pilot data and without assuming variable independence.
A graphic user interface (GUI) software called MetSizeR was developed to implement this approach to estimating sample sizes in R . Effort was focused on designing the interface of MetSizeR to encourage its wide use in the metabolomics community regardless of previous knowledge of R. The software is available through the R statistical software environment http://www.r-project.org.
Metabolomic data sets are typically acquired using analytical technologies such as nuclear magnetic resonance spectroscopy (NMR)  and mass spectrometry (MS) . The spectrum resulting from NMR spectroscopy is usually divided into spectral bins (representing variables) and the signal intensities within the bins are related to the relative abundances of metabolites. MS is typically used for targeted metabolomics in which a specified list of metabolites is measured . The following section describes how the number of samples required for either an NMR or an MS metabolomic experiment can be determined under the MetSizeR approach.
Sample size estimation
Let be the estimate of the average signal intensity μ jg for metabolite j in samples from the treatment group g which has corresponding sample size n g , where g = 1, 2. Often in metabolomics, the goal of discovering a set of metabolites that discriminates between samples from two treatment groups is achieved by testing the hypothesis H oj : μj1 - μj2 = 0, on each metabolite j, where j = 1,…, p. The aim of discovering discriminating metabolites can be framed as a multiple testing problem as there are p hypotheses to be tested and the probability of falsely declaring a metabolite as significant increases with p. It is therefore important to estimate sample size while controlling an error rate to improve the power of the test for identifying significant metabolites. MetSizeR focuses on controlling the false discovery rate (FDR, ). Here, the FDR is the expected number of metabolites incorrectly deemed to be significantly different between the two treatment groups, as a proportion of the total number of metabolites declared to be significant.
The test statistic and its distribution
where S j is the estimate of the pooled standard error for metabolite j. The corresponding within treatment variability estimate is for g = 1, 2 where x(j g)i denotes the signal intensity for metabolite j in sample i from the treatment group g. A correction factor cf is a small positive value added to the standard error of each metabolite to prevent some metabolites with signal intensity near zero from having large test statistic TS j ; such a metabolite may have TS j ≈ 0/0.
The typical assumption about the null distribution (i.e. the distribution under the null hypotheses) of the test statistic TS j is the t-distribution with n1 + n2 - 2 degrees of freedom. However, when the data violate such an assumption, misleading sample size estimates would result. Hence, as in , MetSizeR estimates the null distribution of TS j using a permutation technique. This is a non-parametric method based on the assumption that under the null hypothesis of no difference, the distribution of the test statistic does not change no matter how the group labels of the pilot data are permuted. The data generated using this approach maintains the between subject variability and the amount of noise in the data. The null distribution of the test statistic TS is estimated by randomly permuting the group labels of pilot data and calculating the test statistic for each metabolite, TS jt , where t = 1,…, T permutations.
Analysis based data simulation
where x is the n × p data matrix, u denotes the latent variables, and θ is a vector of unknown model parameters. Simulating from the marginal model is achieved by first generating values of the parameters and the latent variables from the prior distribution p(u, θ), and then simulating the data from the assumed model p(x|u, θ) given the simulated values of u and θ.
- 1.Generate parameter values from their prior distributions:
- 2.Given the generated model parameters and latent variables the pilot data x are then simulated from the PPCA model:
Estimating sample size based on pilot data simulated in this way ensures the estimated sample size is firmly dependent on the type of model being used to analyse the real experimental metabolomic data. Hence, MetSizeR represents an analysis based approach to sample size estimation for metabolomic studies. The specific steps involved in the MetsizeR algorithm are detailed in the next section.
The MetSizeR algorithm
The MetSizeR procedure for sample size estimation starts with a number ntry of different sample sizes and a user-specified FDR (denoted by target.fdr). It then searches for the optimal sample size by estimating the FDR for each of the ntry sample sizes. In order to estimate FDR for each sample size, the null distribution of the test statistics of all metabolites is estimated and then a shift constant is added to the test statistics of some p o metabolites to allow them to be truly significant. The null distribution is estimated by calculating the test statistics of the permuted pilot data. After obtaining the critical values of the null distribution, the FDR is estimated. The optimal sample size is then set to be the sample size with FDR equal to target.fdr.
Specify the input parameters which include the desired level of FDR (target.fdr), the expected proportion m of significant metabolites and the model to be used when analyzing the observed metabolomic data.
Simulate pilot data of sample size n k from the assumed analysis model, where k = 1,…, ntry. Pilot data simulation from the PPCA model is detailed in the previous Section; the Additional file 1 details pilot data simulation from the PPCCA and DPPCA models.
Estimate the null distribution for all metabolites by randomly permuting the group labels of the simulated pilot data and computing the test statistic TS jt for each metabolite j and each permuted data set t for T permutations.
- 4.Estimate the FDR for each permuted data set t = 1,…, T:
Consider the corresponding p-vector of the test statistics for all metabolites on permutation t.
Randomly sample p o = m × p of the test statistics and add to their intensities. This allows p o metabolites to be truly significant. Here, δ is the effect size, and ϱ jt is the true within group standard deviation estimated by .
A cut off point crit is set to be the largest absolute value of the test statistics . All metabolites with |TS jt | > crit are declared as significant. The FDR for permutation t can then be calculated.
Estimate the FDR for data simulation s by taking the 50 th percentile of the FDR values of 1,…, T permutations.
Repeat steps 2 to 5 for s = 1,…, SIM simulations and report the 10 th , 50 th and 90 th percentiles of the FDR values for sample size n k .
Repeat steps 2 to 6 for k = 1,…, ntry different sample sizes and select the optimal sample size as the n k with FDR = target.fdr.
The total number of permutations T used to estimate the sampling distribution of the test statistics TS was chosen to be twenty. In the samr R package  20 permutations were used to estimate the null distribution and they give accurate estimates of the FDR. Here, the value of the effect size δ is chosen based on the variance of the underlying model. The optimal sample size is estimated by predicting the sample size at target.fdr using a simple linear regression model on values of FDR above and below the target.fdr with their corresponding sample sizes n k . The estimated sample size by MetSizeR ensures that the power or the confidence level in statistical tests reaches (1-target.fdr).
Parameter specification: details and guidelines
The MetSizeR algorithm requires the specification of several parameters; some are parameters relevant to the intended analysis model, and some are parameters relevant to the sample size estimation procedure itself.
In terms of the MetSizeR GUI which has been developed, the user is requested to specify parameters specific to the sample size estimation procedure i.e. the number of bins in the NMR or MS spectrum, the expected proportion of significant bins, the target FDR and the minimum sample size they wish to be considered. The default settings of these parameters are 200 spectral bins, 20% significantly different bins, a target FDR of 5% and a minimum sample size of 4. The choice of the number and proportion of significantly different spectral bins will naturally be informed by the metabolomic practitioner’s knowledge, as will the minimum sample size choice. For the target FDR, again this depends on the conservatism of the metabolomic practitioner and/or the research question of interest, but a FDR of 5% is indicative of typical statistical practice. The user can easily re-run the MetSizeR algorithm for different settings of these parameters to ascertain the influence of their particular specifications. However, within the MetSizeR GUI the user has the option of requesting plots of the expected proportion of significant bins versus the FDR, over different sample sizes, giving insight to the influence of this particularly influential parameter on sample size estimation. Regarding the specification of parameters relevant to the intended analysis model, in the MetSizeR GUI, the user is only required to specify the intended analysis model (PPCA, PPCCA or DPPCA), and in the case of PPCCA, the number of covariates to be included. Both of these decisions are again practitioner informed, depending on the particular experiment under consideration. The MetSizeR manual, available through the developed MetSizeR GUI, guides the user through these parameter specification steps using a number of illustrative examples.
The remaining parameters in the MetSizeR algorithm have been fixed within the R code underlying the MetSizeR GUI, but given the open source nature of R, these can be changed by the user if desired. In the context of the PPCA model discussed above the hyperparameters of the prior distributions of the loadings matrix W and the variance σ2 are based on previous estimates of W and σ2 from applications of PPCA to metabolomic data (eg. [7, 8]). Each row of the loadings matrix W is simulated from a standard multivariate Gaussian distribution MVN and the noise variance σ2 is simulated from an inverse gamma distribution with shape parameter α1 = 3 and scale parameter α2 = 4. Hyperparameter settings for the PPCCA and DPPCA models are detailed in the Additional file 1. Within the MetSizeR algorithm four final parameters are specified: the effect size δ (fixed at 2.3, the 99th quantile of the assumed prior distribution of the loadings), the correction factor cf (fixed as the fifth percentile of the estimated standard errors of all metabolites), the number of permutations T (set to 20) and the number of simulations SIM (set at 20). These specifications are based on the choices in [3, 5, 14] in similar sample size estimation settings.
This section illustrates the application of MetSizeR to different metabolomic experimental settings. In the first section, MetSizeR is employed to estimate sample size in the setting where experimental pilot data are not available; the second section considers the case where experimental pilot data are available.
Sample size estimation using simulated pilot data
The expected proportion of significant spectral bins specified by the user impacts on the estimated number of samples required for the metabolomic experiment. Figures 1(B), 1(C) and 1(D) demonstrate the effect on FDR of varying the expected proportion of significant spectral bins for three different sample sizes. The figures show that, increasing the expected proportion of significant spectral bins reduces the FDR.
Figure 2(B) illustrates a third example of the setting where no experimental pilot data are available and the practitioner aims to conduct a longitudinal metabolomic experiment. The pilot data for this example are simulated from the DPPCA model; the data are simulated by only focusing on the first time point of the experiment as it is expected that the same number of subjects will be followed over time and that, while there may be dropouts, the largest number of subjects will be present at the first time point. Figure 2(B) shows that the expected number of samples required for a longitudinal study of 300 spectral bins with 20% significant bins and a target FDR of 5%, is 24 with 12 samples from each treatment group.
Sample size estimation with experimental pilot data
In a situation where experimental pilot data are available, parameter estimates used for simulations are based on fitting the underlying model to the experimental pilot data. Here, the application of MetSizeR is illustrated using real metabolomic data sets as experimental pilot data.
The approach developed here for sample size estimation is not limited to NMR data. The method has been developed to accept data from targeted metabolomic analysis using MS, thus ensuring its applicability across the metabolomics community. Setting MetSizeR specifications as in the previous examples, the PPCA model was fitted to a targeted metabolomic MS pilot data set and under the MetSizR algorithm, the estimated sample size is shown in Figure 3(B).
Discussion and conclusions
Determining sample sizes in metabolomics is important but due to the complexity of these experiments, there are currently no standard methods for sample size estimation in metabolomics. Moreover, since pilot studies are rarely done in metabolomics, sample size estimation approaches for high dimensional data studies requiring experimental pilot data, cannot be applied.
The method presented in this article is a straight forward approach for determining sample sizes for metabolomic experiments whilst controlling the FDR. The main advantage of the developed approach is its ability to determine sample size even when experimental pilot data are not available. Another key advantage is that it takes the type of analysis the researcher intends to use into consideration when estimating sample size and this can improve the power of the study. Also, since MetSizeR employs permutation techniques to estimate sample size, it accounts for correlation between metabolites and effect size variability. The method has been developed to accept both NMR and targeted MS data which will ensure wide applicability in the metabolomics community. Further, a software package facilitates easy implementation of the MetSizeR approach.
Areas of future work are multiple and varied. MetSizeR is currently designed to estimate the number of samples required for metabolomic experiments which involve two groups; modifications to the MetSizeR approach are possible to accommodate different metabolomic experimental designs. Alternatives to the permutation approach employed in MetSizeR could be examined - bootstrap sampling would provide an interesting alternative. Proof of concept metabolomic experiments are currently underway to validate the MetSizeR approach.
Availability and requirements
The package MetSizeR has been developed for the R statistical environment (http://www.r-project.org) and is freely available at http://cran.r-project.org. The package is accompanied by documentation files to facilitate its use.
Project name: MetSizeR
Project home page: http://cran.r-project.org/web/packages/MetSizeR/
Operating system(s): Platform independent.
Programming language: R platform.
Other requirements: No.
License: GPL (≥ 2)
Any restrictions to use: It is available for free download.
Nyamundanda Gift is a PhD candidate in the PhD in Bioinformatics and Systems Biology programme in University College Dublin. Dr. Isobel Claire Gormley is a lecturer in Statistics in the School of Mathematical Sciences, University College Dublin. Dr. Yue Fan is a Postdoctoral Researcher in the School of Biomolecular and Biomedical Science in University College Dublin. Prof. William Gallagher is an Associate Professor in the School Of Biomolecular and Biomedical Science in University College Dublin. Dr. Lorraine Brennan is a lecturer in Nutritional Biochemistry in the School of Agriculture and Food Science, Conway Institute, University College Dublin.
This research was supported by the Irish Research Council for Science Engineering and Technology (IRCSET) funded PhD Programme in Bioinformatics and Systems Biology in University College Dublin, Ireland ( http://bioinformatics.ucd.ie/PhD/) and by HRB Ireland (RP/2006/117).
- Berk M, Ebbels T, Montana G: A statistical framework for biomarker discovery in metabolomic time course data. Bioinformatics. 2011, 27 (14): 1979-1985. 10.1093/bioinformatics/btr289.PubMed CentralView ArticlePubMedGoogle Scholar
- Muller P, Parmigiani G, Robert C, Rousseau J: Optimal sample size for multiple testing. J Am Stat Assoc. 2004, 99 (468): 990-100. 10.1198/016214504000001646.View ArticleGoogle Scholar
- Tibshirani R: A simple method for assessing sample sizes in microarray experiments. BMC Bioinformatics. 2006, 7: 106-10.1186/1471-2105-7-106.PubMed CentralView ArticlePubMedGoogle Scholar
- Liu P, Hwanga JTG: Quick calculation for sample size while controlling false discovery rate with application to microarray analysis. Bioinformatics. 2007, 23 (6): 739-746. 10.1093/bioinformatics/btl664.View ArticlePubMedGoogle Scholar
- Lin WJ, Hsueh HM, Chen JJ: Power and sample size estimation in microarray studies. BMC Bioinformatics. 2010, 11: 48-10.1186/1471-2105-11-48.PubMed CentralView ArticlePubMedGoogle Scholar
- Tipping ME, Bishop CM: Probabilistic principal component analysis. J R Stat Soc: Series B (Stat Method). 1999, 61 (3): 611-622. 10.1111/1467-9868.00196.View ArticleGoogle Scholar
- Nyamundanda G, Gormley IC, Brennan L: Probabilistic principal component analysis for metabolomic data. BMC Bioinformatics. 2010, 11: 571-10.1186/1471-2105-11-571.PubMed CentralView ArticlePubMedGoogle Scholar
- Nyamundanda G, Gormley IC, Brennan L: A dynamic probabilistic principal components model for the analysis of longitudinal metabolomics data. J R Stat Soc Series C. (Appl Stat) (To Appear)
- R Development Core Team: R: a language and environment for statistical computing. 2009, Vienna, Austria: R Foundation for Statistical Computing, [http://www.R-project.org],Google Scholar
- Reo NV: Metabonomics based on NMR spectroscopy. Drug and Chem Toxicol. 2002, 25 (4): 375-382. 10.1081/DCT-120014789.View ArticleGoogle Scholar
- Dettmer K, Aronov PA, Hammock BD: Mass spectrometry-based metabolomics. Mass Spectrometry Rev. 2007, 26: 51-78. 10.1002/mas.20108.View ArticleGoogle Scholar
- Patti GJ, Yanes O, Siuzdak G: Innovation: Metabolomics: the apogee of the omics trilogy. Nat Rev Mol Cell Biol. 2012, 13 (4): 263-269. 10.1038/nrm3314.PubMed CentralView ArticlePubMedGoogle Scholar
- Benjamini Y, Hochberg Y: Controlling false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc. 1995, 57: 289-300.Google Scholar
- Hwang D, Schmitt WA, Stephanopoulos G, Stephanopoulos G: Determination of minimum sample size and discriminatory expression patterns in microarray data. Bioinformatics. 2002, 18: 1184-1193. 10.1093/bioinformatics/18.9.1184.View ArticlePubMedGoogle Scholar
- Carmody S, Brennan L: Effects of pentylenetetrazole-induced seizures on metabolomic profiles of rat brain. Neurochem Int. 2010, 56 (2): 340-344. 10.1016/j.neuint.2009.11.004.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.