 Research article
 Open access
 Published:
Multipletesting correction in metabolomewide association studies
BMC Bioinformatics volume 22, Article number: 67 (2021)
Abstract
Background
The search for statistically significant relationships between molecular markers and outcomes is challenging when dealing with highdimensional, noisy and collinear multivariate omics data, such as metabolomic profiles. Permutation procedures allow for the estimation of adjusted significance levels without assuming independence among metabolomic variables. Nevertheless, the complex nonnormal structure of metabolic profiles and outcomes may bias the permutation results leading to overly conservative threshold estimates i.e. lower than those from a Bonferroni or Sidak correction.
Methods
Within a univariate permutation procedure we employ parametric simulation methods based on the multivariate (log)Normal distribution to obtain adjusted significance levels which are consistent across different outcomes while effectively controlling the type I error rate. Next, we derive an alternative closedform expression for the estimation of the number of nonredundant metabolic variates based on the spectral decomposition of their correlation matrix. The performance of the method is tested for different model parametrizations and across a wide range of correlation levels of the variates using synthetic and real data sets.
Results
Both the permutationbased formulation and the more practical closed form expression are found to give an effective indication of the number of independent metabolic effects exhibited by the system, while guaranteeing that the derived adjusted threshold is stable across outcome measures with diverse properties.
Background
In omics studies many hundreds to tens of thousands of molecular variables are collected for each individual, leading to highdimensional multivariate data which are highly collinear. When analysing these data, many hypothesis tests are conducted simultaneously, thus effective methods to adjust for multiple testing are a central topic, especially in the context of MetabolomeWide Association Studies (MWAS) [1]. The aim is the detection of statistically significant relationships between molecular concentrations and disease outcomes while minimising the risk of false positive associations. A widely used approach for multiple testing is the false discovery rate (FDR) [2] which controls the expected proportion of falsely rejected hypotheses among all those rejected. This approach is effective in the case of independent or positive dependent tests. While there have been some attempts to deal with correlated tests such as [3] that proposed a simple but highly conservative procedure, in general correlation among tests is still a problem for FDR methods. Besides FDR corrections, familywise error rate (FWER) procedures control the probability of making at least one false conclusion (i.e. at least one Type I error). The FWER provides a more stringent control of Type I error compared to the FDR. Nevertheless, conventional FWER methods such as the Bonferroni [4] or Sidak [5] adjustment are known to be overly conservative when the tests are correlated. On the other hand, resamplingbased methods such as the permutation test are a standard tool to simultaneously assess the association of different correlated molecular quantities with an outcome of interest. These procedures can be conducted in both a parametric or nonparametric fashion. Parametric approaches are the preferred methods as they have relatively high power if the assumptions (e.g. normality of the data) hold. Nevertheless, in the context of MWAS the metabolic profiles are very rarely normally distributed nor present a symmetric distribution, and this may bias the result of the chosen significance test.
Thus, a first aim of this study is to overcome this issue and derive a valid yet stable metabolomewide significance level (MWSL) across outcomes with diverse distributional properties. The proposed approach is based on a permutation procedure built from parametric approximation methods via the multivariate Normal and logNormal distributions to describe the set of metabolic profiles while retaining their complex correlated structure up to the 2nd order moments, while effectively controlling the expected overall type I error rate at the \(\alpha\) level. While the proposed resamplingbased method is accurate and asymptotically consistent it demands intensive computation. In the context of genomic studies there have been several attempts to formulate the problem in terms of estimating the number of nonredundant molecular quantities as a closedform eigenvaluebased measure from the spectral decomposition of the empirical correlation matrix of the molecular variables. The available measures proposed by [6,7,8,9], and [10] are found not to be sufficiently accurate as a valid substitute for the proposed permutation procedure. Therefore, a second aim of this study is to derive a permutationfree closedform estimate of the MWSL to express the number of nonredundant molecular quantities. Both the permutationbased MWSL formulation and the more practical closed form expression are tested on synthetic and real data.
Methods
Permutationbased MWSL estimation
Permutation algorithm
Suppose the data consists of n observations, and let Y be the outcome of interest, \(X=(X_{1},\ldots ,X_{M})^T\) the vector of M predictors or features, and \(Z=(Z_{1},\ldots ,Z_{P})^T\) the vector of P fixed effect covariates. The permutationbased MWSL estimation can be described as follows.

Step (1): Shuffle i.e. resample without replacement, the outcome variable Y together with the set of fixed effects confounders Z if any. In this way, the n subjects are resampled under the null hypothesis of no association.

Step (2): To estimate the relationship between the outcome and the set of features while accounting for possible confounding effects, compute M regression models in a univariate approach, that is by using one feature at a time. From each model store the p value associated with the feature of interest. When appropriate, approaches other than regression methods can be used for testing of association e.g. correlation or ttest.

Step (3): Extract the minimum of the set of M p values as this indicates the highest threshold value which would reject all M null hypotheses.

Step (4): Repeat Step (1)–(3) for K times, where K is at least n/2 times [11]. The K minimum p values are the elements of the new vector q.

Step (5): Sort the elements of q, and take the (\(\alpha K\))value of this vector. This value is the MWSL estimate. An approximate confidence interval can be obtained by treating the true position of the MWSL estimate as a Binomial random variable with parameters K and \(\alpha\). Then, using the Normal approximation to the Binomial, we obtain the \(z_{(1\alpha )\%}\) confidence limits by extracting the elements of q in positions \((\alpha K)\pm \{ (1\alpha ) \sqrt{\alpha K (1\alpha )} \}\).

Step (6): Compute the effective number of tests (ENT) defined as the number of independent tests that would be required to obtain the same significance level using the Bonferroni correction ENT = \(\frac{\alpha }{\text {MWSL}}\). The ENT estimate measures the extent that the M markers are nonredundant. Therefore, the ratio R = \(\frac{\text {ENT}}{M}\%\) of the effective and the actual number of tests (ANT or M) is a measure of the dependence among features, which is expected to be closer to 0% when highly correlated features are considered.
Permutationbased procedures have previously been applied in different studies e.g. by [12] to approximate the genomewide significance threshold for dense SNP and resequencing data, or by [13] for urinary metabolic profiles. Recently in the context of NMR metabolic profiling studies [14] employed the permutation algorithm to perform a series of MWAS for serum levels of glucose. Counterintuitively, ENT estimates greater than the ANT were found, with an R ratio for glucose over 400%. With the methodology proposed in this paper, we generalise the algorithm to a more flexible regression context compared to [13], while we provide a robust framework to avoid biased estimates as in [14].
Parametric simulation methods
The underlying assumption of the permutation procedure is that the p values are properly calibrated, that is, every metabolitespecific p value is uniformly distributed, i.e. p value\(_m\) \(\sim U(0,1)\) where \(m=1,\ldots ,M\), when the null hypothesis is true. Because the MWSL is the minimum p value across the metabolite specific tests, all it takes is one poorly calibrated test with an erroneous small p value to bias the MWSL estimation. Very often in metabolomics studies the features are not normally distributed. Nevertheless, normality only matters sometimes. It matters when both the feature and the outcome have a skew distribution [15], while it has very little effect when either the feature or the outcome is normally distributed. In this context, we investigate the properties of the permutation approach for significance level estimation by employing the multivariate Normal and the multivariate logNormal distributions to describe, at least approximately, the set of correlated features and to obtain stable estimates of the MWSL while effectively controlling the maximum overall type I error rate at the \(\alpha\) level. We assume that the data are already centred so that the means equal zero. Therefore, \(X \sim {{\mathcal {N}}}_{M}(\mu , \Sigma ^*)\) is the multivariate Normal distribution employed to simulate the set of features where \(\mu = \text {E}[X]=(\text {E}[X_{1}],\ldots ,\text {E}[X_{M}])^T={\mathbf{0 }}\) is the Mdimensional mean vector of zero means, and \(\Sigma ^*\) is the \((M \times M)\) shrinkage estimator of the covariance matrix as described by [16]. The shrinkage estimator is always positive definite, wellconditioned, more efficient and therefore preferred to the unbiased estimator \(\Sigma\), or to the related maximum likelihood estimator \(\Sigma _{\text {ML}}\). Where the probability density of a feature is right skewed, we use the multivariate logNormal approximation. In such cases, the features are first transformed i.e. the absolute value of their minimum, plus one unit, is added to their original value. The algorithm is applied to realdata and simulated scenarios to illustrate the results for different model parametrizations and various type of outcome, as well as to investigate different correlation levels across features and between the features and the outcome.
Practical approximation of the ENT
The empirical method of computing the permutation test p value is hampered by the fact that a very large number of permutations are required to correctly estimate small, and therefore interesting p values. Thus, we now present a more efficient alternative to a standard permutation test to derive the MWSL. To distinguish from the effective number of nonredundant variates from the permutation procedure which has been defined as ENT in Section “Methods”, here we refer to the estimate from this practical approximation approach as Meff. It has previously been shown that the collective correlation among a set of variables can be measured by the variance of the eigenvalues (\(\lambda\)s) derived from a correlation matrix [6, 17]. In particular, high correlation among variables leads to high \(\lambda\)s, that is, when all variables are completely correlated, the first \(\lambda\) equals the number of variables in the correlation matrix (M) and the rest of the \(\lambda\)s are zero. Vice versa in the case of no correlation among variables, all the \(\lambda\)s will be equal to 1 with zero variance. Hence, the variance of the \(\lambda\)s will range between zero, and M. Based on this concept, within the genomics field several methods have been proposed for estimating the ENT from the correlations between variates. Among the first, [7] proposed to use the variance of the \(\lambda\)s to estimate the ENT for the limiting cases of none/ fully correlated variables, and a ratio of the eigenvalue’s variance to its maximum M for intermediate situations. [8] suggested summing the \(\lambda\)s, after substituting 1 for the \(\lambda\)s that are greater than 1. [9] suggested defining ENT as the number of \(\lambda\)s which can explain a certain percentage of the variation within the data. However, it is unclear how the percentage should be chosen as an excessively large or small value would result in an FWER that is overly conservative or liberal. [10] proposed a measure of ENT based on a \(\lambda\)s ratio function. In the context of our analyses, the Meff measures proposed by these authors were not sufficiently accurate as a valid substitute for the permutation procedure, hence we propose an empirical closedform expression directly related to the correlation among metabolomics variates as follows
This formulation balances the information from the \(\lambda _m\) with \(m=1,\ldots ,M\) estimated eigenvalues from the correlation matrix of the metabolite concentrations, with the contribution of the first eigenvalue \(\lambda _1\) which measures the primary cluster in the matrix, its number of variables, and the average correlation among the features [18]. This formulation is of interest in the context of correlated variates, that is when at least two variates are dependent, i.e. for \(\lambda _1 >1\), and therefore \(\log (\lambda _1) > 0\).
Next, the MWSL can be derived based on the following.

Step (1): Compute the \(\text {Meff}_{\text {MWSL}}\) with the proposed formulation.

Step (2): The MWSL can be derived based on a Bonferroni correction i.e. MWSL = \(\frac{\alpha }{\text {Meff}_{\text {MWSL}}}\) .
Alternatively, the full algorithm as an alternative to the permutation procedure can be described as follows.

Step (1): Compute the \(\text {Meff}_{\text {MWSL}}\) with the proposed formulation.

Step (2): Under the null hypothesis the p value of each metabolite follows a Uniform distribution, i.e. p value\(_m\) \(\sim U(0,1)\), where \(m=1,\ldots ,M\). The distribution of minimum p values can be approximated by the minimum order statistics (r=1), that is \(U_{(1)} \sim\) Beta(1, M) in the case of not correlated molecular variates, and Beta\((1,M')\) with \(M' \leqslant M\) in the case of correlated features. The limit case of very highly correlated features with \(M'=1 (<<M)\) reduces to sampling from a Beta(1, 1) which equals a U(0, 1). It follows that the \(\text {Meff}_{\text {MWSL}}\) can be used to approximate the distribution of minimum p values by sampling from a Beta\((1,\text {Meff}_{\text {MWSL}})\).

Step (3): The MWSL and its respective \(z_{(1\alpha )}\)% confidence limits can be derived as described in Section “Permutation algorithm”, Step (5)–(6) of the permutation procedure.
Results
Study of experimental metabolomics data
The MWAS approach was employed to investigate the association between human serum \(^{1}\)H NMR metabolic profiles and various clinical outcomes in the MultiEthnic Study of Atherosclerosis (MESA) [19]. The data have been extensively described in [14]. Briefly, the cohort includes participants (51% females, 49% males), aged 4484 years, (mean=63 years) from four different ethnic groups: ChineseAmerican, AfricanAmerican, Hispanic, and Caucasian, all recruited between 2000 and 2002 at clinical centres in the United States and free of symptomatic cardiovascular disease at baseline. Demographic, medical history, anthropometric, and lifestyle data, as well as serum samples were collected, together with information on diabetes, and lipid and blood pressure treatment. Metabolic profiles were obtained using \(^{1}\)H NMR at 600 MHz and processed as detailed in [20]. The outcomes of interest are glucose concentrations and the body mass index (BMI). Table 1 presents the descriptive statistics for the clinical outcome measures, while Table 2 reports the descriptive statistics for the fixed effects covariates used in the study. Three sets of NMR spectra are considered: (1) a standard watersuppressed onedimensional spectrum (NOESY), (2) a CarrPurcellMeiboomGill spectrum (CPMG), and (3) a lower resolution version of the CPMG data (BINNED^{Footnote 1}). The BINNED version consists of M=655 features, while the NOESY and CPMG contain M=30,590 features. The BINNED data sample comprises of n=3,500 individuals, while the NOESY and CPMG data have n=3,867 individuals. All MWSL calculations are performed for \(\alpha =0.05\).
From the conventional permutation procedure applied to the BINNED data shown in Fig. 1, when the real features are considered, there is instability in the estimation of the ENT across the different outcomes, and in particular the ENT estimate for glucose is above the ANT. When the data are simulated from a multivariate logNormal or Normal as described in Section “Parametric simulation methods” the ENT estimates are stable across the different outcomes and remain bounded below the total number of features with an average ENT around 350 and an R ratio around 50%. To assess the validity of this result in terms of redundancy of the set of features we considered principal component analysis (PCA) as an alternative method for estimating the ENT [6,7,8]. The cumulative proportion of variance explained by the first 350 PCs is around 99%. This is consistent with the interpretation that there are effectively 350 uncorrelated features in the data.
Figure 2 reports the ENT estimates for CPMG data. Without any transformations applied, there is a very large variation across the ENT estimates for the different outcomes, and in particular a very high and meaningless estimate for glucose levels which goes beyond R = 400%. On the other hand, when the set of features is simulated from the multivariate Normal and from the multivariate logNormal distribution the corresponding ENT estimate is below the total number of features, and stable across different outcomes with an average ENT of around 16,000 features and an R ratio around 50%. In this case the usefulness of the proposed permutation method to estimate the ENT is clear since the PCAbased ENT estimate would be constrained to the maximum number of PCs (\(n=3866\) i.e. max no. PC is n1).
Figure 3 reports the ENT estimates for the NOESY data which are below R=100% but vary across outcomes when the original set of features is considered. When simulated features from the multivariate (log) Normal distributions are considered we obtain lower ENT values than the ones from the CPMG data, with an average ENT of around 2700 features and an R ratio around 9%. This result was expected due to the reduced influence of broad signals in CPMG spectra compared to NOESY, which is linked to a weaker covariance structure. By applying a PCA to the NOESY data the cumulative proportion of variance explained by the first 2,700 PCs is around 99%, and this is in line with our findings.
Next, by exploiting the approximation method described in Section “Practical approximation of the ENT”, we derive the proposed \(\text {Meff}_{\text {MWSL}}\). Table 3 provides this estimate compared to the available alternative methods proposed by [7,8,9], and [10], and the ENT estimate from the permutation procedure which is the averaged estimate of the results obtained via the multivariate and logmultivariate Normal transformations. Considering the complexity of the eigenvalue structure in cases of very large data sets, the proposed \(\text {Meff}_{\text {MWSL}}\) in most cases seems to be able to consistently quantify, at least approximately, the correlation structure of the metabolomic variates. Based on this Meff estimate, to derive the MWSL and its confidence limits we simulate from a Beta(1,\(\text {Meff}_{\text {MWSL}}\)) which lets us obtain MWSL estimate of the same order of magnitude as those from the permutation procedure as shown in Table 4.
Simulation study
We now broaden the investigation by considering various correlation levels of the set of molecular variates as well as cases of correlation between the outcome and the variates. At first we generate various sets of variates, each of these with a specific and well bounded correlation level. This is performed following the algorithm described in Section “Parametric algorithm to generate synthetic variates”. Specifically, we generated nine sets of variates covering the whole range of positive correlation levels. Next, we generate outcomes both correlated and uncorrelated to the variates which we will employ within the permutation procedure to estimate the ENT across the various sets of correlated molecular variates. Uncorrelated outcomes of different shapes are easily simulated via parametric distributions such as the Normal distribution for a symmetric outcome, the SkewNormal distribution for a left skewed outcome, and a Weibull distribution for a right skewed outcome. Figure 4 shows the ENT estimates in the case of correlated variates and uncorrelated outcomes. Simulated correlated outcomes can be obtained as a linear combination of a few randomly chosen molecular variates with added noise, or via procedures based on Cholesky decomposition as is performed when simulating correlated features following the algorithm detailed in Section “Parametric algorithm to generate synthetic variates”. Figure 5 show the ENT estimates from the permutation procedure for the various sets of synthetic molecular variates and the correlated simulated outcomes. We conclude that correlation to the outcome makes no discernible difference to relationships between ENT and variatevariate correlation. Lastly, we apply the \(\text {Meff}_{\text {MWSL}}\) approximation to derive the results in Tables 5 and 6. The ENT from the permutation procedure is averaged from the results in Figs. 4 and 5. In this simulated environment, the \(\text {Meff}_{\text {MWSL}}\) approximation outperforms other available methods and describes well the permutationbased ENT estimates.
Validation of the approach
A type I error (falsepositive) occurs when a true null hypothesis is rejected. To check whether the permutation procedure accounts for the FWER at the \(\alpha\) level, for each metabolic variate and across the permutation replicates, we measure the type I error rate as the number of occurrences having a p value less or equal to the MWSL. Rather than the original realdata outcomes we test the multivariate (log)Normal permutation procedure to calculate the MWSL using various synthetic outcomes. In particular, we employ a continuous outcome from a Normal distribution, a discretebinary outcome from a Binomial distribution, a discretecount outcome from a Poisson distribution, and a timetoevent survival outcome from the Cox proportional hazards model as in [21]. We benchmark our result on the MESA BINNED data but also on a set of synthetic variates obtained via a nonparametric approach using PCA (see Section 6.2). We divide the data into test and nontest sets, compute a PCA model of the nontest data, and predict the test data based on this model. This approach allows us to generate synthetic data based on the structure of the real data without involving bootstrap/permutation methods [22] which we already employ to estimate the MWSL. Following the algorithm of Section 7.2 applied to the MESA BINNED data, we define the test and the nontest set by randomly sampling \(n_t=1,500\) and \(n_{{\bar{t}}}=35001500=2,000\) observations, respectively. From the PCA on the nontest set we select 350 PCs to be used to build the simulated test set of molecular variates \({{\hat{X}}}_t\). Table 7 and Table 8 confirm that the MWSL procedure effectively controls the FWER close to the (default) \(\alpha\)level of 5%.
Conclusions
In this paper we focus on assessing univariate test significance in multi collinear omics data by estimating a significance level threshold controlling the family wise error rate. The proposed procedure is based on an iterative permutation approach via univariate regression models while other measures of association may be used when appropriate. The molecular variates are simulated via parametric methods such as multivariate Normal and multivariate logNormal distributions to retain the correlation structure in the data, while controlling the false positive rate at the desired level. When the permutation procedure is applied to the approximated data the MWSL is stable across outcome measures with diverse properties.
In MWAS, the metabolic profiles often exhibit a high degree of collinearity, and this is supported by our finding that in all scenarios considered, when parametric methods are applied to approximate the structure of the data, the MWSL estimated through the permutation procedure is larger than the threshold obtained via a metabolomewide Bonferroni or Sidak corrections. Therefore, the corresponding ENT is always less than the actual number of tests as it mainly depends on the extent of correlation within the data. The extent of collinearity is summarized by the R ratio (%) of effective to actual number of tests. For the examples in this paper, R was found to be around 50% for the CPMG data (highresolution and BINNED version), and around 9% for the NOESY high resolution. This is consistent with the expected higher degree of correlations between spectral variables in the NOESY data. As with other approaches, the proposed closedform Meff approximation to the permutationbased ENT could be tentatively interpreted as the number of independent metabolic processes exhibited by the system. Both the MWSL or the Meff estimate can be employed downstream of the analysis to identify differentially regulated metabolites.
Availability of data and materials
MWSL is an opensource R software package avaiable at https://github.com/AlinaPeluso/PhenoMeNal. Within the package we made available the lower resolution CPMG data referred to in the text as MESA BINNED data. An R tutorial is available as a supplementary material.
Notes
Binning is a common data reduction approach in NMR metabolomics.
Abbreviations
 MWAS:

MetabolomeWide Association studies
 FDR:

false discovery rate
 FWER:

familywise error rate
 MWSL:

metabolomewide significance level
 ENT:

effective number of test
 ANT:

actual number of test
References
Holmes E, Loo RL, Stamler J, Bictash M, Yap IK, Chan Q, Ebbels T, De Iorio M, Brown IJ, Veselkov KA, et al. Human metabolic phenotype diversity and its association with diet and blood pressure. Nature. 2008;453(7193):396.
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc: Ser B (Methodol). 1995;57(1):289–300.
Benjamini Y, Yekutieli D, et al. The control of the false discovery rate in multiple testing under dependency. Ann Stat. 2001;29(4):1165–88.
Bonferroni C. Teoria statistica delle classi e calcolo delle probabilità. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commericiali di Firenze. 1936;8:3–62.
Šidák Z. Rectangular confidence regions for the means of multivariate normal distributions. J Am Stat Assoc. 1967;62(318):626–33.
Cheverud JM. A simple correction for multiple comparisons in interval mapping genome scans. Heredity. 2001;87(1):52.
Nyholt DR. A simple correction for multiple testing for singlenucleotide polymorphisms in linkage disequilibrium with each other. Am J Human Genet. 2004;74(4):765–9.
Li J, Ji L. Adjusting multiple testing in multilocus analyses using the eigenvalues of a correlation matrix. Heredity. 2005;95(3):221.
Gao X, Starmer J, Martin ER. A multiple testing correction method for genetic association studies using correlated single nucleotide polymorphisms. Genet Epidemiol. 2008;32(4):361–9.
Galwey NW. A new measure of the effective number of tests, a practical tool for comparing families of nonindependent significance tests. Genetic Epidemiol. 2009;33(7):559–68.
Paparoditis E, Politis DN. The local bootstrap for periodogram statistics. J Time Ser Anal. 1999;20(2):193–222.
Hoggart CJ, Clark TG, De Iorio M, Whittaker JC, Balding DJ. Genomewide significance for dense snp and resequencing data. Genetic Epidemiol. 2008;32(2):179–85.
ChadeauHyam M, Ebbels TM, Brown IJ, Chan Q, Stamler J, Huang CC, Daviglus ML, Ueshima H, Zhao L, Holmes E, et al. Metabolic profiling and the metabolomewide association study: significance level for biomarker identification. J Proteome Res. 2010;9(9):4620–7.
Castagné R, Boulangé CL, Karaman I, Campanella G, Santos Ferreira DL, Kaluarachchi MR, Lehne B, Moayyeri A, Lewis MR, Spagou K, et al. Improving visualization and interpretation of metabolomewide association studies: An application in a populationbased cohort using untargeted 1HNMR metabolic profiling. J Proteome Res. 2017;16(10):3623–33.
Box GE, Watson GS. Robustness to nonnormality of regression tests. Biometrika. 1962;49(1–2):93–106.
Schäfer J, Strimmer K. A shrinkage approach to largescale covariance matrix estimation and implications for functional genomics. Statistical applications in genetics and molecular biology. 2005;4(1).
Cheverud JM, Rutledge J, Atchley WR. Quantitative genetics of development: genetic correlations among agespecific trait values and the evolution of ontogeny. Evolution. 1983;37(5):895–905.
Friedman S, Weisberg HF. Interpreting the first eigenvalue of a correlation matrix. Educ Psychol Measur. 1981;41(1):11–21.
Bild DE, Bluemke DA, Burke GL, Detrano R, Diez Roux AV, Folsom AR, Greenland P, JacobsJr DR, Kronmal R, Liu K, et al. Multiethnic study of atherosclerosis: objectives and design. Am J Epidemiol. 2002;156(9):871–81.
Karaman I, Ferreira DL, Boulangé CL, Kaluarachchi MR, Herrington D, Dona AC, Castagné R, Moayyeri A, Lehne B, Loh M, et al. Workflow for integrated processing of multicohort untargeted 1h nmr metabolomics data in largescale metabolic epidemiology. J Proteome Res. 2016;15(12):4188–94.
Bender R, Augustin T, Blettner M. Generating survival times to simulate cox proportional hazards models. Stat Med. 2005;24(11):1713–23.
Hastings WK. Monte carlo sampling methods using markov chains and their applications; 1970.
Higham NJ. Computing the nearest correlation matrix–a problem from finance. IMA J Numer Anal. 2002;22(3):329–43.
Martens H, Martens M. Multivariate analysis of quality. An introduction. Bristol: IOP Publishing; 2001.
Horizon2020 EC. PhenoMeNal (Phenome and Metabolome aNalysis): Largescale Computing for Medical Metabolomics (20152018). https://phenomenalh2020.eu/.
Acknowledgements
We thank Dr Marc ChadeauHyam and Dr Raphaele Castagne for useful discussions.
Funding
This work was undertaken as part of the PhenoMeNal project [25], European Commission grant EC654241. Robert C Glen and Timothy M D Ebbels were partially supported by the NIHR Imperial Biomedical Research Centre (BRC). The funding body did not play any role in the design of the study, collection, analysis, interpretation of data or writing the manuscript.
Author information
Authors and Affiliations
Contributions
AP performed all the analysis, and wrote the manuscript. AP produced the R software implementation. AP and TE conceived and planned the project. TE and RG supervised the whole work. RG provide the funding for analysis. All authors revised and approved the final manuscript.
Corresponding author
Ethics declarations
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Competing interests
Timothy M D Ebbels is a member of the editorial board. The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1
. R tutorial MWSL.
Appendix
Appendix
Parametric algorithm to generate synthetic variates

Step (1): Generate a square \((M \times M)\) correlation matrix A assuming all variables have unit variance, i.e. the M elements on the diagonal are 1s. The \([M(M1)]/2\) elements of the upper triangular matrix are sampled from Uniform distributions bounded by a certain interval, e.g. high correlation level within the interval [0.75,0.85], medium correlation in [0.45,0.55], or low correlations in [0.25,0.35]. The lower triangle elements are copied from the upper triangle.

Step (2): As the \(\lambda\)s of A are required to be greater than zero, compute S as the nearest positive definite to the correlation matrix A achieving \(\{\min {{\left\ AS\right\ }_F: \, S \, \text {is a correlation matrix}}\}\), where \({\left\ A\right\ }^2_F =\sum _{i,j} a^2_{ij}\) as described by [23].

Step (3): Derive the lower triangular matrix L via Cholesky decomposition of matrix S such that S = LL’.

Step (4): M multivariate Normal features with zero means result from the product ZL between the \((n \times M)\) matrix Z of M random N(0,1) i.i.d. features, and the \((M \times M)\) lower triangular matrix L. The correlations of the simulated features are very close to those assigned in matrix A.
Nonparametric algorithm to generate synthetic variates

Step (1): By randomly sampling \(n_t\) observations from the original data matrix of variates X, construct the \((n_t\times M)\) test set of variates \(X_t\), and the \((n_{{\bar{t}}} \times M)\) nontest set \(X_{{\bar{t}}}\), with \(n_t<n\) and \(n_{{\bar{t}}}=nn_t\).

Step (2): Standardise the test and the nontest sets by subtracting their respective vectors of column means i.e. \(\mu _t\) and \(\mu _{{\bar{t}}}\), and dividing by their standard deviations i.e. \(\sigma _t\) and \(\sigma _{{\bar{t}}}\), to respectively obtain \(Z_t\) and \(Z_{{\bar{t}}}\).

Step (3): Compute PCA over the nontest set by applying singular value decomposition (SVD) such that \(Z_{{\bar{t}}}=U_{{\bar{t}}} \Sigma _{{\bar{t}}} V_{{\bar{t}}}^T\), where \(V_{{\bar{t}}}^T\) is the \((M \times M)\) matrix of loadings, while the PC scores are obtained as the product between the \((n_{{\bar{t}}} \times n_{{\bar{t}}})\) matrix \(U_{{\bar{t}}}\) of eigenvectors of \(Z_{{\bar{t}}} Z_{{\bar{t}}}^T\), and the \((n_{{\bar{t}}} \times M)\) diagonal matrix \(\Sigma _{{\bar{t}}}\).

Step (4): Use the nontest loadings \(V_{{\bar{t}}}\) combined with the test \(X_t\) to compute the \((n_t \times M)\) matrix \({{\hat{U}}}_t {\hat{\Sigma }}_t\) of PC predicted scores for the test set, i.e. \({{\hat{U}}}_t {\hat{\Sigma }}_t = X_t V_{{\bar{t}}}\).

Step (5): Build the \((n_t \times M)\) simulated test set of variates \(\hat{Z_t}\) as the product of the predicted scores from Step (4), and the matrix of loadings \(V_{{\bar{t}}}^T\) from Step (3) such that \({{\hat{Z}}}_t= {{\hat{U}}}_t {\hat{\Sigma }}_t V_{{\bar{t}}}^T\). We note that S PCs, with \(S \le M\), can be selected to be used in the predictions, thus \({{\hat{Z}}}_t\) would result from the product of the \((n_t \times S)\) matrix of PCs and the \((S \times M)\) matrix of loadings.

Step (6): From the simulated test set of standardised features \({{\hat{Z}}}_t\) compute the \((M \times M)\) set of simulated features as \({{\hat{X}}}_t = {{\hat{Z}}}_t \sigma _t + \mu _t\).
To simulate the set of variates in such way the sample size of the data should be large enough for the data to be split between the test and the nontest set, and no missing values are allowed. Nevertheless, a possible extension of this method would consider the Nonlinear Iterative Partial Least Squares (NIPALS) algorithm as a modified PCA to accommodate missing values [24].
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Peluso, A., Glen, R. & Ebbels, T.M.D. Multipletesting correction in metabolomewide association studies. BMC Bioinformatics 22, 67 (2021). https://doi.org/10.1186/s12859021039752
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859021039752