Multiple-testing correction in metabolome-wide association studies

Background The search for statistically significant relationships between molecular markers and outcomes is challenging when dealing with high-dimensional, noisy and collinear multivariate omics data, such as metabolomic profiles. Permutation procedures allow for the estimation of adjusted significance levels without assuming independence among metabolomic variables. Nevertheless, the complex non-normal structure of metabolic profiles and outcomes may bias the permutation results leading to overly conservative threshold estimates i.e. lower than those from a Bonferroni or Sidak correction. Methods Within a univariate permutation procedure we employ parametric simulation methods based on the multivariate (log-)Normal distribution to obtain adjusted significance levels which are consistent across different outcomes while effectively controlling the type I error rate. Next, we derive an alternative closed-form expression for the estimation of the number of non-redundant metabolic variates based on the spectral decomposition of their correlation matrix. The performance of the method is tested for different model parametrizations and across a wide range of correlation levels of the variates using synthetic and real data sets. Results Both the permutation-based formulation and the more practical closed form expression are found to give an effective indication of the number of independent metabolic effects exhibited by the system, while guaranteeing that the derived adjusted threshold is stable across outcome measures with diverse properties.

Briefly, the cohort includes participants (51\% females, 49\% males), aged 44-84 years, (mean age of 63 years) from four different ethnic groups: Chinese-American, African-American, Hispanic, andCaucasian, all recruited between 2000-2002 at clinical centres in the United States and free of symptomatic cardiovascular disease at baseline. Demographic, medical history, anthropometric, and lifestyle data, as well as serum samples were collected, together with information on diabetes, and lipid and blood pressure treatment.

Arguments:
outcome a vector of n data point values of a continuous (both symmetric and skewed), discrete binary (0/1) or count outcome, or a data frame with n observations and column variables time (or time1 and time2 ) and status for a time-to-event survival outcome.
features a data frame of n observations (rows) and M features e.g. metabolic profiles (columns).
confounders an optional data frame of n observations (rows) and P fixed effects confounders (columns). Default to confounders = NULL .
methods an optional string which can take values 'identity' if no transformation is applied to the data, or 'mN' (or 'mlogN' ) when the set of features is simulated via a multivariate Normal (or multivariate log-Normal) distribution.
n.permutation an optional numeric value. Default to n.permutation =10,000.
alpha an optional probability value. Default to alpha =0.05.
verbose an optional logical value which allows output some status messages while computing. Default to verbose = TRUE .

Outputs:
matPvals the matrix of p-values for the M features (columns) and the n.permutation (rows). ENT_CI.low = lower value alpha \%-confidence interval ENT; t1err.percent the estimated type I error (\%).
Optimal performances in terms of computational time are achieved when the procedure runs on a multi-core computer as parallel computing is applied within the function to deal with the heaviest steps.
The empirical method of computing the permutation test p-value is hampered by the fact that a very large number of permutation is required to correctly estimate small, and therefore interesting p-values. Thus, the MWSL::Meff allows for an efficient approximation alternative. To distinguish from the effective number of non-redundant variates from the permutation procedure which has been defined as ENT , here we refer to the estimate from this practical approximation approach as Meff . The approximation is based on the spectral decomposition of the correlation matrix of the metabolomics variates, and allows for the comparison of the proposed estimate with methods of interest from the genomics field which proposed a similar formulation based on the same concept.

Arguments:
features a data frame of n observations (rows) and M features e.g. metabolic profiles (columns).  Once the appropriate analysis has been performed the metabolites of interest can be identified comparing the respective p-value to the the adjusted thresold (MWSL or Meff). When the raw p-value corresponding to a certain feature is smaller than the adjusted thresold we identify that metabolomic variate as significant. For the other metabolites we conclude that there is no association between the changes in the outcome variable and the shifts in these features.

Arguments:
outcome a vector of n data point values of a continuous (both symmetric and skewed), discrete binary (0/1) or count outcome, or a data frame with n observations and column variables time (or time1 and time2 ) and status for a time-to-event survival outcome.
features a data frame of n observations (rows) and M features e.g. metabolic profiles (columns).
confounders an optional data frame of n observations (rows) and P fixed effects confounders (columns). Default to confounders = NULL . vennPlot an optional logic value to be set to TRUE for the visualisation of the Venn-plot for the total count of differrentally regulated metabolomics variates when employing the Bonferroni, the Benjamini-Hochberg (FDR), and the MWSL (FWER) correction, respectively verbose an optional logic value to suppress some output status messages. Default to TRUE .
res.DE_names names of the of differrentally regulated metabolomics variates when employing the Bonferroni, the Section 4: Identification of outcome-specific differentially regulated metabolites Benjamini-Hochberg (FDR), and the MWSL (FWER) correction, respectively.