Multiple-testing correction in metabolome-wide association studies

The search for statistically significant relationships between molecular markers and outcomes is challenging when dealing with high-dimensional, noisy and collinear multivariate omics data, such as metabolomic profiles. Permutation procedures allow for the estimation of adjusted significance levels without assuming independence among metabolomic variables. Nevertheless, the complex non-normal structure of metabolic profiles and outcomes may bias the permutation results leading to overly conservative threshold estimates i.e. lower than those from a Bonferroni or Sidak correction. Within a univariate permutation procedure we employ parametric simulation methods based on the multivariate (log-)Normal distribution to obtain adjusted significance levels which are consistent across different outcomes while effectively controlling the type I error rate. Next, we derive an alternative closed-form expression for the estimation of the number of non-redundant metabolic variates based on the spectral decomposition of their correlation matrix. The performance of the method is tested for different model parametrizations and across a wide range of correlation levels of the variates using synthetic and real data sets. Both the permutation-based formulation and the more practical closed form expression are found to give an effective indication of the number of independent metabolic effects exhibited by the system, while guaranteeing that the derived adjusted threshold is stable across outcome measures with diverse properties.

rate (FDR) [2] which controls the expected proportion of falsely rejected hypotheses among all those rejected. This approach is effective in the case of independent or positive dependent tests. While there have been some attempts to deal with correlated tests such as [3] that proposed a simple but highly conservative procedure, in general correlation among tests is still a problem for FDR methods. Besides FDR corrections, family-wise error rate (FWER) procedures control the probability of making at least one false conclusion (i.e. at least one Type I error). The FWER provides a more stringent control of Type I error compared to the FDR. Nevertheless, conventional FWER methods such as the Bonferroni [4] or Sidak [5] adjustment are known to be overly conservative when the tests are correlated. On the other hand, resampling-based methods such as the permutation test are a standard tool to simultaneously assess the association of different correlated molecular quantities with an outcome of interest. These procedures can be conducted in both a parametric or non-parametric fashion. Parametric approaches are the preferred methods as they have relatively high power if the assumptions (e.g. normality of the data) hold. Nevertheless, in the context of MWAS the metabolic profiles are very rarely normally distributed nor present a symmetric distribution, and this may bias the result of the chosen significance test.
Thus, a first aim of this study is to overcome this issue and derive a valid yet stable metabolome-wide significance level (MWSL) across outcomes with diverse distributional properties. The proposed approach is based on a permutation procedure built from parametric approximation methods via the multivariate Normal and log-Normal distributions to describe the set of metabolic profiles while retaining their complex correlated structure up to the 2nd order moments, while effectively controlling the expected overall type I error rate at the α level. While the proposed re-sampling-based method is accurate and asymptotically consistent it demands intensive computation. In the context of genomic studies there have been several attempts to formulate the problem in terms of estimating the number of non-redundant molecular quantities as a closed-form eigenvalue-based measure from the spectral decomposition of the empirical correlation matrix of the molecular variables. The available measures proposed by [6][7][8][9], and [10] are found not to be sufficiently accurate as a valid substitute for the proposed permutation procedure. Therefore, a second aim of this study is to derive a permutation-free closed-form estimate of the MWSL to express the number of non-redundant molecular quantities. Both the permutationbased MWSL formulation and the more practical closed form expression are tested on synthetic and real data.

Permutation algorithm
Suppose the data consists of n observations, and let Y be the outcome of interest, X = (X 1 , . . . , X M ) T the vector of M predictors or features, and Z = (Z 1 , . . . , Z P ) T the vector of P fixed effect covariates. The permutation-based MWSL estimation can be described as follows.
Step (6): Compute the effective number of tests (ENT) defined as the number of independent tests that would be required to obtain the same significance level using the Bonferroni correction ENT = α MWSL . The ENT estimate measures the extent that the M markers are non-redundant. Therefore, the ratio R = ENT M % of the effective and the actual number of tests (ANT or M) is a measure of the dependence among features, which is expected to be closer to 0% when highly correlated features are considered.
Permutation-based procedures have previously been applied in different studies e.g. by [12] to approximate the genome-wide significance threshold for dense SNP and resequencing data, or by [13] for urinary metabolic profiles. Recently in the context of NMR metabolic profiling studies [14] employed the permutation algorithm to perform a series of MWAS for serum levels of glucose. Counterintuitively, ENT estimates greater than the ANT were found, with an R ratio for glucose over 400%. With the methodology proposed in this paper, we generalise the algorithm to a more flexible regression context compared to [13], while we provide a robust framework to avoid biased estimates as in [14].

Parametric simulation methods
The underlying assumption of the permutation procedure is that the p values are properly calibrated, that is, every metabolite-specific p value is uniformly distributed, i.e. p value m ∼ U (0, 1) where m = 1, . . . , M , when the null hypothesis is true. Because the MWSL is the minimum p value across the metabolite specific tests, all it takes is one poorly calibrated test with an erroneous small p value to bias the MWSL estimation. Very often in metabolomics studies the features are not normally distributed. Nevertheless, normality only matters sometimes. It matters when both the feature and the outcome have a skew distribution [15], while it has very little effect when either the feature or the outcome is normally distributed. In this context, we investigate the properties of the permutation approach for significance level estimation by employing the multivariate Normal and the multivariate log-Normal distributions to describe, at least approximately, the set of correlated features and to obtain stable estimates of the MWSL while effectively controlling the maximum overall type I error rate at the α level. We assume that the data are already centred so that the means equal zero. Therefore, X ∼ N M (µ, � * ) is the multivariate Normal distribution employed to simulate the set of features where µ = E ]) T = 0 is the M-dimensional mean vector of zero means, and * is the (M × M) shrinkage estimator of the covariance matrix as described by [16]. The shrinkage estimator is always positive definite, well-conditioned, more efficient and therefore preferred to the unbiased estimator , or to the related maximum likelihood estimator ML . Where the probability density of a feature is right skewed, we use the multivariate log-Normal approximation. In such cases, the features are first transformed i.e. the absolute value of their minimum, plus one unit, is added to their original value. The algorithm is applied to real-data and simulated scenarios to illustrate the results for different model parametrizations and various type of outcome, as well as to investigate different correlation levels across features and between the features and the outcome.

Practical approximation of the ENT
The empirical method of computing the permutation test p value is hampered by the fact that a very large number of permutations are required to correctly estimate small, and therefore interesting p values. Thus, we now present a more efficient alternative to a standard permutation test to derive the MWSL. To distinguish from the effective number of non-redundant variates from the permutation procedure which has been defined as ENT in Section "Methods", here we refer to the estimate from this practical approximation approach as Meff. It has previously been shown that the collective correlation among a set of variables can be measured by the variance of the eigenvalues ( s) derived from a correlation matrix [6,17]. In particular, high correlation among variables leads to high s, that is, when all variables are completely correlated, the first equals the number of variables in the correlation matrix (M) and the rest of the s are zero. Vice versa in the case of no correlation among variables, all the s will be equal to 1 with zero variance. Hence, the variance of the s will range between zero, and M. Based on this concept, within the genomics field several methods have been proposed for estimating the ENT from the correlations between variates. Among the first, [7] proposed to use the variance of the s to estimate the ENT for the limiting cases of none/ fully correlated variables, and a ratio of the eigenvalue's variance to its maximum M for intermediate situations. [8] suggested summing the s, after substituting 1 for the s that are greater than 1. [9] suggested defining ENT as the number of s which can explain a certain percentage of the variation within the data. However, it is unclear how the percentage should be chosen as an excessively large or small value would result in an FWER that is overly conservative or liberal. [10] proposed a measure of ENT based on a s ratio function. In the context of our analyses, the Meff measures proposed by these authors were not sufficiently accurate as a valid substitute for the permutation procedure, hence we propose an empirical closed-form expression directly related to the correlation among metabolomics variates as follows This formulation balances the information from the m with m = 1, . . . , M estimated eigenvalues from the correlation matrix of the metabolite concentrations, with the contribution of the first eigenvalue 1 which measures the primary cluster in the matrix, its number of variables, and the average correlation among the features [18]. This formulation is of interest in the context of correlated variates, that is when at least two variates are dependent, i.e. for 1 > 1 , and therefore log( 1 ) > 0.
Next, the MWSL can be derived based on the following. • Step (1): Compute the Meff MWSL with the proposed formulation. • Step (2) Alternatively, the full algorithm as an alternative to the permutation procedure can be described as follows. • Step (1) Step (3): The MWSL and its respective z (1−α) % confidence limits can be derived as described in Section "Permutation algorithm", Step (5)-(6) of the permutation procedure.

Study of experimental metabolomics data
The MWAS approach was employed to investigate the association between human serum 1 H NMR metabolic profiles and various clinical outcomes in the Multi-Ethnic Study of Atherosclerosis (MESA) [19]. The data have been extensively described in [14]. Briefly, the cohort includes participants (51% females, 49% males), aged 44-84 years, (mean=63 years) from four different ethnic groups: Chinese-American, African-American, Hispanic, and Caucasian, all recruited between 2000 and 2002 at clinical centres in the United States and free of symptomatic cardiovascular disease at baseline. Demographic, medical history, anthropometric, and lifestyle data, as well as serum samples were collected, together with information on diabetes, and lipid and blood pressure treatment. Metabolic profiles were obtained using 1 H NMR at 600 MHz and processed as detailed in [20]. The outcomes of interest are glucose concentrations and the body mass index (BMI). Table 1 presents the descriptive statistics for the clinical outcome  measures, while Table 2 reports the descriptive statistics for the fixed effects covariates used in the study. Three sets of NMR spectra are considered: (1) a standard water-suppressed one-dimensional spectrum (NOESY), (2) a Carr-Purcell-Meiboom-Gill spectrum (CPMG), and (3)  From the conventional permutation procedure applied to the BINNED data shown in Fig. 1, when the real features are considered, there is instability in the estimation of the ENT across the different outcomes, and in particular the ENT estimate for glucose is above the ANT. When the data are simulated from a multivariate log-Normal or Normal as described in Section "Parametric simulation methods" the ENT estimates are stable across the different outcomes and remain bounded below the total number of features  with an average ENT around 350 and an R ratio around 50%. To assess the validity of this result in terms of redundancy of the set of features we considered principal component analysis (PCA) as an alternative method for estimating the ENT [6][7][8]. The cumulative proportion of variance explained by the first 350 PCs is around 99%. This is consistent with the interpretation that there are effectively 350 uncorrelated features in the data. Figure 2 reports the ENT estimates for CPMG data. Without any transformations applied, there is a very large variation across the ENT estimates for the different outcomes, and in particular a very high and meaningless estimate for glucose levels which goes beyond R = 400%. On the other hand, when the set of features is simulated from the multivariate Normal and from the multivariate log-Normal distribution the corresponding ENT estimate is below the total number of features, and stable across different outcomes with an average ENT of around 16,000 features and an R ratio around 50%. In this case the usefulness of the proposed permutation method to estimate the ENT is clear since the PCA-based ENT estimate would be constrained to the maximum number of PCs ( n = 3866 i.e. max no. PC is n-1). Figure 3 reports the ENT estimates for the NOESY data which are below R=100% but vary across outcomes when the original set of features is considered. When simulated features from the multivariate (log-) Normal distributions are considered we obtain lower ENT values than the ones from the CPMG data, with an average ENT of around 2700 features and an R ratio around 9%. This result was expected due to the reduced influence of broad signals in CPMG spectra compared to NOESY, which is linked to a weaker covariance structure. By applying a PCA to the NOESY data the cumulative proportion of variance explained by the first 2,700 PCs is around 99%, and this is in line with our findings.
Next, by exploiting the approximation method described in Section "Practical approximation of the ENT", we derive the proposed Meff MWSL . Table 3 provides this estimate compared to the available alternative methods proposed by [7][8][9], and [10], and the ENT estimate from the permutation procedure which is the averaged estimate of the results obtained via the multivariate and log-multivariate Normal transformations. Considering the complexity of the eigenvalue structure in cases of very large data sets, the proposed Meff MWSL in most cases seems to be able to consistently quantify, at least approximately, the correlation structure of the metabolomic variates. Based on this Meff estimate, to derive the MWSL and its confidence limits we simulate from a Beta(1,Meff MWSL ) which lets us obtain MWSL estimate of the same order of magnitude as those from the permutation procedure as shown in Table 4.

Simulation study
We now broaden the investigation by considering various correlation levels of the set of molecular variates as well as cases of correlation between the outcome and the variates. At first we generate various sets of variates, each of these with a specific and well bounded correlation level. This is performed following the algorithm described in Section "Parametric algorithm to generate synthetic variates". Specifically, we generated nine sets of variates covering the whole range of positive correlation levels. Next, we generate outcomes both correlated and uncorrelated to the variates which we will employ within the permutation procedure to estimate the ENT across the various sets of correlated molecular variates. Uncorrelated outcomes of different shapes are easily simulated via parametric distributions such as the Normal distribution for a symmetric outcome, the Skew-Normal distribution for a left skewed outcome, and a Weibull distribution for a right skewed outcome. Figure 4 shows the ENT estimates in the case of correlated variates and uncorrelated outcomes. Simulated correlated outcomes can be obtained as a linear combination of a few randomly chosen molecular variates with added noise, or via procedures based on Cholesky decomposition as is performed when simulating correlated features following the algorithm detailed in Section "Parametric algorithm to generate synthetic variates". Figure 5 show the ENT estimates from the permutation procedure for the various sets of synthetic molecular variates and the correlated simulated outcomes. We conclude that correlation to the outcome makes no discernible difference to relationships between ENT and variate-variate correlation. Lastly, we apply the Meff MWSL approximation to derive the results in Tables 5 and 6. The ENT from the permutation procedure is averaged from the results in Figs. 4 and 5. In this simulated

Validation of the approach
A type I error (false-positive) occurs when a true null hypothesis is rejected. To check whether the permutation procedure accounts for the FWER at the α level, for each metabolic variate and across the permutation replicates, we measure the type I error rate as the number of occurrences having a p value less or equal to the MWSL. Rather than the  original real-data outcomes we test the multivariate (log-)Normal permutation procedure to calculate the MWSL using various synthetic outcomes. In particular, we employ   original data, multivariate Normal, multivariate log-Normal. Error bars represent 95% confidence limits. K=10,000 permutations a continuous outcome from a Normal distribution, a discrete-binary outcome from a Binomial distribution, a discrete-count outcome from a Poisson distribution, and a time-to-event survival outcome from the Cox proportional hazards model as in [21]. We benchmark our result on the MESA BINNED data but also on a set of synthetic variates obtained via a nonparametric approach using PCA (see Section 6.2). We divide the data into test and non-test sets, compute a PCA model of the non-test data, and predict the test data based on this model. This approach allows us to generate synthetic data based on the structure of the real data without involving bootstrap/permutation methods [22] which we already employ to estimate the MWSL. Following the algorithm of Section 7.2 applied to the MESA BINNED data, we define the test and the nontest set by randomly sampling n t = 1, 500 and n¯t = 3500 − 1500 = 2, 000 observations, respectively. From the PCA on the nontest set we select 350 PCs to be used to build the simulated test set of   Table 7 and Table 8 confirm that the MWSL procedure effectively controls the FWER close to the (default) α-level of 5%.

Conclusions
In this paper we focus on assessing univariate test significance in multi collinear omics data by estimating a significance level threshold controlling the family wise error rate. The proposed procedure is based on an iterative permutation approach via univariate regression models while other measures of association may be used when appropriate. The molecular variates are simulated via parametric methods such as multivariate Normal and multivariate log-Normal distributions to retain the correlation structure in the data, while controlling the false positive rate at the desired level. When the permutation procedure is applied to the approximated data the MWSL is stable across outcome measures with diverse properties.
In MWAS, the metabolic profiles often exhibit a high degree of collinearity, and this is supported by our finding that in all scenarios considered, when parametric methods are applied to approximate the structure of the data, the MWSL estimated through the permutation procedure is larger than the threshold obtained via a metabolomewide Bonferroni or Sidak corrections. Therefore, the corresponding ENT is always less than the actual number of tests as it mainly depends on the extent of correlation within the data. The extent of collinearity is summarized by the R ratio (%) of effective to actual number of tests. For the examples in this paper, R was found to be around 50% for the CPMG data (high-resolution and BINNED version), and around 9% for the NOESY high resolution. This is consistent with the expected higher degree of correlations between spectral variables in the NOESY data. As with other approaches, the proposed closed-form Meff approximation to the permutation-based ENT could be tentatively interpreted as the number of independent metabolic processes exhibited by the system. Both the MWSL or the Meff estimate can be employed downstream of the analysis to identify differentially regulated metabolites.
• Step (4): Use the nontest loadings V¯t combined with the test X t to compute the (n t × M) matrix Û tˆ t of PC predicted scores for the test set, i.e. Û tˆ t = X t V¯t. • Step (5): Build the (n t × M) simulated test set of variates Ẑ t as the product of the predicted scores from Step (4), and the matrix of loadings V T t from Step (3) such that Ẑ t =Û tˆ t V T t . We note that S PCs, with S ≤ M , can be selected to be used in the predictions, thus Ẑ t would result from the product of the (n t × S) matrix of PCs and the (S × M) matrix of loadings. • Step (6): From the simulated test set of standardised features Ẑ t compute the (M × M) set of simulated features as X t =Ẑ t σ t + µ t .
To simulate the set of variates in such way the sample size of the data should be large enough for the data to be split between the test and the nontest set, and no missing values are allowed. Nevertheless, a possible extension of this method would consider the Nonlinear Iterative Partial Least Squares (NIPALS) algorithm as a modified PCA to accommodate missing values [24].