Maximum rank reproducibility (MaRR)
Maximum Rank Reproducibility (MaRR) was proposed to assess reproducibility of gene ranks in replicate experiments by [26]. Since, MaRR is a non-parametric procedure, it does not assume any distributional parameters on the underlying structure of reproducible features. The core idea behind developing MaRR is based on a maximum rank statistic, such that, the procedure has the ability to detect the transition from reproducible to irreproducible signals by minimizing the mean squared error between the observed and theoretical survival function. The survival function S(t), is the probability that a subject survives longer than time t. For this reason, the Survival function is often regarded as complementary cumulative distribution function.
Using the MaRR procedure, we propose to examine the reproducibility of ranked lists from replicate experiments and assess how concordant the metabolites are ranked in replicate experiments. Any numeric value for a metabolite feature, such as abundance, test statistic, p-value, q-value or fold change score can be used to rank the features. Then the method utilizes the ranks and not the original measurements. Additional file 1: Fig. S1 illustrates an example dataset of rank statistics for \(M=10,000\) metabolites under the ideal setting of a perfect split (when the correlation between highly ranks signals is 1 and correlation between low ranked signals is 0).
Metabolites with reproducible measurements should be consistently highly ranked for both replicate experiments (blue, Additional file 1: Fig. S1), and are expected to have positive correlation in their ranks, whereas, metabolites with independent measures are assumed to have independent ranks and considered irreproducible (red, Additional file 1: Fig. S1).
MS-metabolomics data design
We introduce notation to describe MS-Metabolomics data resulting from a study design with D layers \((D \ge 1)\). The layers in a study design may be technical (e.g., batches, technical replicates) or biological (e.g., disease status of subjects), and all layers measure abundances of metabolites. Under each layer in a MS-Metabolomics study design, multiple replicate experiments are obtained. Under each layer d, \(d=1,2,\ldots ,D\), the reproducibility of these experiments are generally assessed across pairwise combinations of replicate experiments, i.e., \({n_d \atopwithdelims ()2}\) pairwise combinations, where \(n_d\) is the number of replicate experiments at layer d, where \(d=1\) indicates the top layer, \(d=2,\ldots ,{D-1}\) indicate the intermediate layers and \(d=D\) is the bottom layer in a study design.
In each replicate MS-Metabolomics experiment, a large number of unique metabolites with respect to mass-charge ratio and retention time are obtained. Each metabolite m examined is assumed to be associated with a continuous measurement from each of \(n_D\) replicate experiments by the layer D (bottom layer), where M is the total number of metabolites. Let \(x^{d}_{m,i}\) be the abundance measure of mth metabolite and the ith replicate experiment at layer d, such that, \(m=1,2,\ldots ,M\) and \(i=1,2,\ldots ,n_d\). We describe the structure and notation of these layers in the following sections. For notational simplicity, we will now use I as the total number of replicate experiments at the bottom layer.
Single layer
Let \(x^{1}_{m,i}\) be the measure of mth metabolite and the ith replicate experiment under a single layered study design. An example of this study design is a data set that has the abundance measurements of M metabolites and I biological subjects (the replicate experiments in this case). See Fig. 2c for an example of a single-layered study design.
Double layer
Let \(x^{\omega _1}_{m,i}\) be the measure of mth metabolite and the ith replicate experiment under a two-layered study design, i.e., \(\omega _1\) (top layer) and i (bottom layer- replicate experiment level), where \(\omega _1=1,\ldots ,n_{\omega _1}\) and \(i=1,\ldots ,I\), such that, \(n_{\omega _1}\) and I denote the total number of replicate experiments at the first (top) and bottom (second in this case) layers, respectively respectively respectively. An example of this study design is a data set that has abundance measurements of M metabolites and I replicate experiments under each of the \(n_{\omega _1}\) biological subjects, i.e., in total there are \(I \times n_{\omega _1}\) samples. See Fig. 2b for an example of a double-layered study design.
Triple layer
Let \(x^{\omega _1,\omega _2}_{m,i}\) be the measure of mth metabolite and the ith replicate experiment under a three-layered study design, i.e., \(\omega _1\) (top layer), \(\omega _2\) (middle layer) and i (bottom layer- replicate experiment level), where \(\omega _1=1,\ldots ,n_{\omega _1}\), \(\omega _2=1,\ldots ,n_{\omega _2}\) and \(i=1,\ldots ,I\), such that, \(n_{\omega _1}, n_{\omega _2} \ge 2\). \(n_{\omega _1}\), \(n_{\omega _2}\), and I denote the total number of replicate experiments at the first (top), second and bottom (third in this case) layers, respectively respectively. An example of this study design is a data set that has abundance measurements of M metabolites and I replicate experiments under two layered study design where the middle layer is the technical layer (e.g., different operators or runs) with \(n_{\omega _2}\) technical replicates and the top layer is subject layer with \(n_{\omega _1}\) biological subjects, i.e., in total there are \(I \times n_{\omega _1} \times n_{\omega _2}\) samples. See Fig. 2a for an example of a triple-layered study design.
N layer
Let \(x^{\omega _1,\omega _2,\ldots ,\omega _{N-1}}_{m,i}\) be the measure of mth metabolite and the ith replicate experiment under an N-layered study design, i.e., \(\omega _1\) (top layer), \(\omega _2\) (second layer), \(\ldots\), \(\omega _{N-1}\) (\((N-1)\) layer) and i (bottom layer- replicate experiment level), where \(\omega _1=1,\ldots ,n_{\omega _1}\), \(\omega _2=1,\ldots ,n_{\omega _2}\), \(\ldots\), \(\omega _{N-1}=1,\ldots ,n_{\omega _{N-1}}\) and \(i=1,\ldots ,I\), such that, \(n_{\omega _1}, n_{\omega _2},\ldots ,n_{\omega _{N-1}} \ge 2\). \(n_{\omega _1}\), \(n_{\omega _2}\), \(\ldots\), \(n_{\omega _{N-1}}\) and I denote the total number of replicate experiments at the first (top), second, \(\ldots\), \((N-1)\) and bottom (last) layers, respectively.
MaRR procedure for MS-metabolomics
For notational simplicity, we assume only the single layered study design to describe the MS-Metabolomics data sets in the context of the MaRR procedure as discussed in “Single layer section”. MaRR assumes no missing values (see below for data pre-processing steps regarding missing values). Even though, we have data for more than two replicate experiments, the MaRR application to MS-based metabolomics data focuses on pairwise replicate experiments. Thus, under the single layered study design, we implement MaRR on pairwise combinations of I replicate experiments, i.e., \({I \atopwithdelims ()2}\) combinations of pairwise replicate experiments, where \({\mathbf {x}}_{i}\) and \({\mathbf {x}}_{i'}\) are the replicate data sets, such that, \(i\ne i^{'}\), where \({\mathbf {x}}_{i}=(x_{1,i},\ldots ,x_{M,i})^{'}\). Moreover, \({\mathbf {X}}\) is a \(M \times I\) matrix, such that, \({\mathbf {X}}=({\mathbf {x}}_{1},\ldots ,{\mathbf {x}}_{I})\), where \({\mathbf {x}}_{i'}\) be the vector of abundances of M metabolites on the ith replicate experiment. These abundances are converted into rank statistics. Each metabolite is assigned a rank in each of the two replicate: \((R_{m,i},\,R_{m,i^{'}})\), where \(R_{m,i}\) is the rank among \(x_{1,i},\ldots ,x_{M,i}\) and likewise for \(R_{m,i^{'}}\). Now, the maximum rank statistic for the metabolite m can be defined as,
$$\begin{aligned} \hbox {Max}_{m}=\hbox {max}(R_{m,i},\,R_{m,i^{'}}), \end{aligned}$$
(1)
for \(m=1,\ldots ,M\).
Consider Additional file 1: Table S1 detailing a subset of a real data of \(M=6\) metabolites to describe the calculation of maximum rank statistics from a pair of replicate experiments. It is to be noted that metabolites that are highly ranked will have a relatively low value for their maximum rank statistic. On the other hand, low ranked metabolites will have higher values. Thus, choosing a threshold value based on the maximum rank can have the potential to separate reproducible from irreproducible signals.
Estimator of the proportion of reproducible metabolites
We define the proportion of reproducible metabolites between sample pairs as \(\pi _1\). Due to the first assumption of the MaRR procedure under ideal settings [26], \(\hbox {Max}_{y}<\hbox {Max}_{z}\) for all reproducible metabolites y and irreproducible metabolites z. This implies that all metabolites y such that \(\hbox {Max}_{y}/M \le \pi _1\) are reproducible, and all metabolites z such that \(\hbox {Max}_{z}/M >\pi _1\) are irreproducible. Rank pairs and maximum rank statistics for a sample data set generated under the ideal assumptions with \(\pi _1 = 0.35\) are provided in Additional file 1: Fig. S1. MaRR uses the survival function in its estimator derivation of \(\pi _1\). Thus, the empirical survival function is given by,
$$\begin{aligned} {\hat{S}}_{M}(x)= \frac{1}{M}\sum _{y=1}^{M}{I(\hbox {Max}_{y}/M\ge x)}, \quad x \in (0,1). \end{aligned}$$
(2)
Let \(\pi _1 \in (0,1)\) be fixed, then according to the properties of MaRR under the ideal setting, the marginal limiting distribution of the random variable \(\hbox {Max}_{z}/M\) as \(M \rightarrow \infty\),
$$\begin{aligned} S_{\pi _1}(x)= {\left\{ \begin{array}{ll} 1 &{} x<\pi _1\\ 1-\frac{(x-\pi _1)^2}{(1-\pi _1)^2}, &{} \pi _1\le x \le 1. \\ 0 &{} 1<x \end{array}\right. } \end{aligned}$$
A weighted mean squared error between two functions for \(\lambda \in (0,1)\) is defined as:
$$\begin{aligned} MSE_{M}(\lambda )=(M-l_{\lambda })^{-1}\sum _{x=l_{\lambda }}^{M}{[{\hat{S}}_{M}(x/M) - (1-\lambda )S_{\lambda }(x/M)]^2}, \end{aligned}$$
(3)
where \(l_{\lambda }=\max _{l=1,\ldots ,M} (l: l/M \le \lambda )\). The minimizer of \(MSE_{M}(\lambda )\) is used to estimate \(\pi _1\). [26] showed the desirable performance of this estimator across a variety of scenarios.
We assume the following under the realistic setting:
-
Reproducible signals tend to be ranked higher than irreproducible signals, that is, \(P(R_{y,i} < R_{y,i^{'}})>1/2\) and \(P(R_{z,i} < R_{z,i^{'}})>1/2\) if metabolite y is reproducible and metabolite z is irreproducible for any replicate sample pair \((i,i^{'})\).
-
The correlation between the ranks of reproducible signals is nonnegative.
-
The two ranks per irreproducible metabolite are independent.
The important difference between assumptions that separates the first assumption of realistic and ideal settings is the lack of clear split between reproducible and irreproducible signals with respect to \(\hbox {Max}_{y}\). As a result, the estimator \(\hat{\pi _1}\) derived in [26] is consistent in the ideal case whereas conservatively biased in the realistic case. In realistic settings, reproducible signals \(\hbox {Max}_{y}/M\) have a positive probability of falling in the region \((\pi _1,1)\).
For computational convenience, the discrete and rescaled version of \(\pi _1\) was used, i.e., \({\hat{k}}\).
$$\begin{aligned} {\hat{k}}= \underset{{l=l_{\lambda }}}{\text {arg min}}[MSE_{M}(l/M)], \end{aligned}$$
(4)
where \(l_{\lambda }=\displaystyle \max _{\{l=1,\ldots ,M \}} (l: l/M \le \lambda )\).
In practice, \({\hat{k}}\) is a good estimate when reproducible signals begin transition to irreproducible signals. We chose the value of \(\lambda\) to be 0.9 on the basis of a large number of simulated datasets with varying degrees of effect size and proportion of reproducible signals. However, this assertion cannot yet be proven theoretically. Although, for certain datasets with small effect size (e.g., mean parameter), we might need to reduce the \(\lambda\) value [26]. In our MS-Metabolomics datasets, it was not required as the effect sizes were relatively large compared to the real RNA-seq datasets in [26].
Estimation of reproducible signal for metabolite m and replicate sample pair \((i,\,i')\)
To determine the set of reproducible metabolites, a critical value \({\hat{N}}\) was chosen, according to an error rate. All metabolites from a replicate sample pair experiment \((i,i^{'})\) will be assigned reproducible if \(\hbox {Max}_{y} \le {\hat{N}}\). This approach of declaring metabolites is defining a rejection as \((0,{\hat{N}})\), and rejecting the null hypothesis [irreproducibility for all signals with \(\hbox {Max}_{y}\) in the region \((0,{\hat{N}})\)] [27]. When an irreproducible metabolite was declared reproducible, Type I error (false discovery) was committed. Marginal false discovery rate (mFDR) is estimated based on a rejection region [28].
We assume z be the possible outcomes from simultaneous hypotheses [29], where U is the number of true null hypothesis that were correctly not rejected (true negatives), V is the number of false rejections (false positives), T is the number of hypotheses that were not rejected when they should have been (false negatives), and S is the number of correctly rejected hypotheses (true positives). Q is the total number of rejections made (rejected null hypotheses). The mFDR [28] is given by,
$$\begin{aligned} mFDR=\frac{E[V]}{E[Q]}. \end{aligned}$$
(5)
The above quantity is similar to the classical FDR as defined by [30]. To define mFDR estimate based on the MaRR procedure, the following notations are introduced:
$$\begin{aligned}&Q(l)=\sum _{y=1}^{M}I(\hbox {Max}_{y} \le l)= \text {Number of metabolites declared reproducible for critical region}\,(0,\,l).\\&V_k(l)=\text {Irreproducible metabolites declared reproducible with}\,k< \hbox {Max}_{y} \le l. \end{aligned}$$
By using the above notations, the mFDR estimate for using l as the threshold value for declaring reproducibility is given by,
$$\begin{aligned} m{\widehat{FDR}}(l)=\frac{E[V_{{\hat{k}}}(l)]}{Q(l)}. \end{aligned}$$
(6)
The denominator of the above expression can be directly computed from the data whereas the numerator needs to be calculated using the distribution of \(\hbox {Max}_{z}\) and is also dependent on \({\hat{k}}\) [26]. The final expression of the numerator is given by,
$$\begin{aligned} E[V_{{\hat{k}}}(l)]= \frac{(l-{\hat{k}})^2}{M-{\hat{k}}},\quad l={\hat{k}}+1,\ldots ,M. \end{aligned}$$
(7)
The mFDR estimate associated with any rejection region \((0,\,l)\) can then be defined as,
$$\begin{aligned} m{\widehat{FDR}}(l)=\frac{E[V_{{\hat{k}}}(l)]}{Q(l)}=\frac{(l-{\hat{k}})^2}{Q(l)(M-{\hat{k}})},\quad l={\hat{k}}+1,\ldots ,M. \end{aligned}$$
(8)
We summarize the MaRR procedure for set of m metabolites each with two metrics generated from sample pair experiments as below:
$$\begin{aligned} {\hat{k}}= \underset{{l=l_{\lambda }}}{\text {arg min}}[MSE_{M}(l/M)], \end{aligned}$$
(9)
The FDR is controlled at a nominal level \(\alpha\) if the threshold value \({\hat{N}}\) is chosen to be
$$\begin{aligned} {\hat{N}}=\displaystyle \max _{{\hat{k}} <l \le M} \{l:m{\widehat{FDR}}(l) \le \alpha \}, \end{aligned}$$
(10)
and the set of metabolites generated from replicate pair of experiments with maximum rank statistics less than or equal to \({\hat{N}}\) are declared reproducible. We detect the indices of the metabolites from replicate sample pair experiments which are declared reproducible. We define an indicator variable \(r_{{m,(i,i')}}=1(0)\) to be a reproducible (irreproducible) signal for metabolite m and replicate sample pair \((i,i')\) as below:
$$\begin{aligned} r_{{m,(i,i')}}= {\left\{ \begin{array}{ll} 1, &{} I(\hbox {Max}_{m} \le {\hat{N}})\\ 0 &{} {\textit{otherwise}}. \end{array}\right. } \end{aligned}$$
Evaluating reproducibility
We create a matrix of reproducible signals with M rows (total number of metabolites) and J columns (\(J={I \atopwithdelims ()2}\)) (replicate sample pairs \({I \atopwithdelims ()2}\)), where J is the total possible number of sample pairs of replicate experiments. We assign metabolite m to be reproducible if a certain percentage signals (\(100c_s\%\)) are reproducible for pairwise combinations of replicate experiments across all study designs, i.e., if
$$\begin{aligned} \frac{{\sum _{i<i'}{{{r_{m,{(i,i')}}}}}}}{J} >c_s, \end{aligned}$$
(11)
such that, \(c_s \in (0,1)\).
Similarly, we assign a sample pair \((i,i')\) for each study design \({\omega _1}\) to be reproducible if a certain percentage signals (\(100c_m\%\)) are reproducible across all metabolites, i.e., if
$$\begin{aligned} \frac{\sum _{m}{{r{_{m,(i,i')}}}}}{M}>c_m, \end{aligned}$$
(12)
such that, \(c_m \in (0,1)\). Figure 1 illustrates the schematic filtering approach of reproducible signal matrix by rows (metabolites) and columns (sample pairs).
Illustrative data sets
Chronic Obstructive Pulmonary Disease (COPD) is a major cause of morbidity and mortality in the United States. The multi-center Genetic Epidemiology of COPD (COPDGene) study was designed to study the underlying genetic factors of COPD, [31]. This study enrolled 10,263 individuals from 2008 to 2011 (Visit 1) who were aged 45–80 with \(\ge 10\) pack-year smoking history and no exacerbations for \(>30\) days. From 2013 to 2017, 6758 subjects returned for an in-person 5-year visit (Visit 2). Each in-person visit included spirometry before and after albuterol, quantitative CT imaging of the chest, and blood sampling.
Technical (Tech) data set
Three technicians performed all steps of sample prep for profiling of a single base human plasma sample collected from COPDGene visit 1 containing six spiked in control compounds at concentrations of 1X, 2X and 4X and two negative controls at 1X in all samples. Processing of the raw data was performed in Agilent’s Mass Hunter software.
The Tech data set is a triple-layered MS-Metabolomics experiment. Figure 2a shows the hierarchical structure of the Tech data set, (i) the top layer being the batch operators, (ii) the middle layer is the spike-ins under each batch operator, and (iii) the bottom layer is the technical replicates under each spike-in. Based on our generic notations in “Triple layer section”, we can write, \(n_{\omega _1}\), \(n_{\omega _2}\) and I be the total number of replicate experiments at the top (batch operator), middle (spike-in control) and bottom (technical replicate) layers, respectively. Additional file 1: Table S2(a) illustrates the design of the MS-Metabolomics experiment for a single operator of the Tech data set.
Biological–Technical (BioTech) data set
Fresh frozen plasma was collected at COPDGene Visit 1 from 131 subjects. These samples were analyzed using untargeted LC–MS (C18+ and HILIC+) metabolomics. The lipid fraction of the human plasma collected from current and former smokers was analyzed using Time of Flight (ToF) liquid chromatograph (LC) (Agilent 6210 Series) and a Quadrupole ToF mass spectrometer (Agilent 6520) which yielded combined data on 2999 metabolite features before data-preprocessing. Data are available at the Metabolomics Workbench with Study ID ST000601, and data processing is described in [32].
The BioTech data set (Fig. 2b) can be treated as a double-layered MS-Metabolomics experiment as described in “Double layer section”. We can write, \(n_{\omega _1}=131\) and \(I=3\) be the total number of replicate experiments at the top (biological subjects) and bottom (technical replicates), respectively. Additional file 1: Table S2(b) illustrates the design of the MS-Metabolomics experiment for the BioTech data set.
Biological (Bio) data set
Within COPDGene 1136 subjects participated in an ancillary study in which they provided fresh frozen plasma at Visit 2. The plasma was profiled using the Metabolon Global Metabolomics platform using an untargeted gas chromatography–mass spectrometry and liquid chromatography–mass spectrometry (GC–MS and LC–MS) based metabolomic quantification protocol as described previously [12, 33]. The platform reported 1392 features including 1064 annotated features. A data normalization step was performed to correct variation resulting from instrument inter-day tuning differences: metabolite intensities were divided by the metabolite run day median then multiplied by the overall metabolite median. Subjects with aggregate metabolite median z-scores greater than 3.5 standard deviation from the mean \((\hbox {N}=6)\) of the cohort were removed.
The Bio data set (Fig. 2c) is a single-layered MS-Metabolomics experiment (“Single layer section”). Note that, there are no technical replicates in this data set. We can write, \(x^{1}_{m,i}\) be the abundance measures of mth metabolite and the ith replicate experiment where \(i=1,\ldots ,I(=1130)\). \(I=1130\) denotes the total number of replicate experiments. Additional file 1: Table S2(c) illustrates the design of the MS-Metabolomics experiment for the Bio data set.
Data pre-processing
We processed the three data sets described in “Illustrative data sets section” using the MSPrep software [34]. The data sets used in this paper are all log transformed. The data pre-processing include three steps and they are as follows:
Filtering
For each of the data sets, metabolites with more than 20% missing values were removed [22, 35]. (a) Tech data set-Post filtering, \(M=860\) metabolites remain. (b) BioTech data set-Post filtering, \(M=2860\) metabolites remain. (c) Bio data set-Post filtering, \(M=995\) metabolites remain.
Missing value imputation techniques
For all data sets, we used BPCA to impute missing values, which implements a Bayesian version of the PCA (Principal Component Analysis) [36, 37]. It combines an Expectation Maximization (EM) approach for PCA with a Bayesian model and imputes the missing values. This R function is available in R/Bioconductor package pcaMethods. As a secondary analysis to compare imputation methods, we also used kNN and Random Forest (RF) imputations for the BioTech dataset. kNN imputation was originally developed for high-dimensional microarray gene expression data [38]. For each metabolite with missing values, this method finds the k (\(=5\)) nearest metabolites using Euclidean metric and missing values are imputed by averaging the nearest neighboring non-missing values. We applied the R package VIM for this imputation approach. Further, we used RF to impute the missing values. In many literatures, it has been observed that RF-based imputation outperformed other imputation methods for imputing metabolomics data [39,40,41]. We applied the R package missForest for RF-based imputation.
Normalization
Unwanted variation appears from various sources in metabolomics studies. Normalization is an important step for the downstream statistical analysis of metabolomics data. For all three data sets, we employ the msnormalize function with the quantile normalization options in MSPrep [34] to normalize the data while maintaining biological variation in the replicates. As a secondary analysis to compare normalization methods, we also used the median normalization option in MSPrep [34] for the BioTech data set.
Pooling the abundance measurements
To evaluate the top layer of the hierarchy in the MS-metabolomics data design, we need to sum the abundances in the lower layer levels. We illustrate the sum of abundance measurements approach to measure reproducibility at each layer of the hierarchical MS-Metabolomics data sets using the Tech data set. For within spike-ins, i.e., between technical replicates, we rank and compare metabolites from each replicate pair. To make comparisons between spike-ins using the same operator, we sum abundance measurements over three replicates per spike-in. Subsequently, we rank the total abundance measurements for each spike-in and apply MaRR to measure the reproducibility metrics (\(\pi _1\)s) pairs of spike-ins under each batch operator, i.e., nine pairs of spike-in across three batch operators. Further, to make comparisons between batch operators for the same biological subject, we sum abundance measurements over all replicates under three spike-in per batch operator. MaRR is then applied to these pooled data set. We also repeat the sum of abundance measurements approach in the BioTech data set. However, this approach is not required to perform in the Bio data set (single-layered MS-Metabolomics data set).
Dealing with ties
In the rare occurrence of ties, ties in the ranking method were treated by random assignment. For example, if there are 10 metabolites and 2 are tied for smallest values, then one of the tied metabolites will be randomly assigned to the 9th position and the other one to the 10th position.
Simulations
The main goal of the simulation settings is to generate data with known reproducible/irreproducible signals that mimics the characteristics of one of our example data sets (BioTech described in “Illustrative data sets section”) to examine the performance of MaRR procedure. For this purpose, we describe two sets of simulation studies under extensive simulation settings. Since the MaRR procedure is applied to pair of replicate experiments, and to simplify notation, we assume one layer. In these simulations, we evaluate the accuracy of estimates of the reproducible signals \(\pi _1\), FDR control and the power for detecting the true reproducible signals. In the simulations, we assume that the data are log transformed.
Settings for simulation I
In simulation I, we generate data from a Bivariate Normal distribution using the means, standard deviations and Pearson’s correlation coefficient between replicate pair experiments of the processed BioTech data set.
We first ran the MaRR procedure to identify which metabolites were reproducible or irreproducible (see Methods), then calculated summary statistics of the reproducible/irreproducible metabolites and used those to generate simulation data sets In summary, (i) the quartiles of the proportion of reproducible signals (\(\pi _1\)) ranged between 0.90 and 0.95 for 393 sample pairs and 2860 metabolites. (ii) The quartiles of the means of the reproducible abundance measures (\(\mu _R\)), for replicate experiments ranged between 3.89 and 4.13. (iii) The quartiles of the means of the irreproducible abundance measures (\(\mu _{IR}\)), for replicate experiments ranged between 3.19 and 3.21. (iv) The quartiles of the standard deviations of the reproducible abundance measures (\(\sigma _R\)) for replicate experiments ranged between 0.16 and 0.17. (v) The quartiles of the standard deviations of the irreproducible abundance measures (\(\sigma _{IR}\)) for replicate experiments ranged between 0.04 and 0.05. (vi) The quartiles of the Pearson’s correlation coefficient of the reproducible abundance measures (\(\rho _R\)) between replicate experiments ranged between 0.992 and 0.994. We summarized the quartile measurements of these parameters in Additional file 1: Table S3. In addition, we presented the boxplots showing the spread of these parameters in Additional file 1: Fig. S2.
We fix the true values of the following parameters because their quartile ranges were extremely narrow: mean of the irreproducible abundance measures for replicate experiments (\(\mu _{IR}=3.2\)), standard deviation of the reproducible abundance measures for replicate experiments (\(\sigma _R=0.17\)) and standard deviation of the irreproducible abundance measures for replicate experiments (\(\sigma _{IR}=0.05\)). For the other parameters, we chose values from the quartile ranges: the mean of the reproducible abundance measures (\(\mu _R\)), the proportion of reproducible signals (\(\pi _1\)), and the correlation between reproducible pairs of abundance measures(\(\rho _R\)).
Under realistic settings of the MaRR procedure, (i) we assume that test statistics with large values will be highly ranked and (ii) the correlation between irreproducible abundance measure is zero. The test statistics for reproducible signals are generated from the Bivariate Normal distribution as follows:
$$\begin{aligned} \left( \begin{array}{c} t_{m,1}\\ t_{m,2} \end{array}\right)\sim & {} {\mathcal {N}} \left[ \left( \begin{array}{c} \mu _R\\ \mu _R \end{array}\right) ,\left( \begin{array}{cc} \sigma _R^2 &{} \sigma _R^2 \rho _R\\ \sigma _R^2 \rho _R &{} \sigma _R^2 \end{array}\right) \right] . \end{aligned}$$
The test statistics for irreproducible signals are generated from the standard bivariate Normal distribution assuming that the two test statistics are independent:
$$\begin{aligned} \left( \begin{array}{c} t_{m,1}\\ t_{m,2} \end{array}\right)\sim & {} {\mathcal {N}} \left[ \left( \begin{array}{c} \mu _{IR}\\ \mu _{IR} \end{array}\right) ,\left( \begin{array}{cc} \sigma _{IR}^2 &{} 0\\ 0 &{} \sigma _{IR}^2 \end{array}\right) \right] . \end{aligned}$$
To make the situations more challenging and exhaustive, we include 24 settings by varying \(\mu _R\in \{3.89,4.01,4.13\}\), \(\rho _R\in \{0.45,0.99\}\), and \(\pi _1\in \{0.2,0.4,0.75,0.9\}\) based on the summary statistics described above For each setting, we simulate 1000 data sets of metabolite size \(M=2860\), which is the number of metabolites in the BioTech data set, and apply the MaRR procedure.
Settings for simulation II
The simulation design is kept similar to the original MaRR paper [26]. Thus, to test the performance of the MaRR under extensive simulation settings, we implement the simulations with extreme choices of the parameters (\(\pi _1\) and \(r_0\)), and do not assume normality for all values.
Each simulated data will consist of proportion of reproducible signals and a minimum correlation for lowest ranked signals as parameters. The largest values of \(t_{m,1}\) and \(t_{m,2}\) are assumed to be highly ranked metabolites. For each reproducible metabolite m, the first test statistic \(t_{m,1}\) is generated according to \(t_{m,1} \sim {\textit{Uniform}}(4,5)\). The second test statistic \(t_{m,2}\) is dependent in such a way that, the correlation between \(t_{m,1}\) and \(t_{m,2}\) is linearly dependent according to a Normal distribution given \(t_{m,1}\),
$$\begin{aligned} t_{m,2}| t_{m,1} \sim {\mathcal {N}}(t_{m,1},1-r_m^{2}), \end{aligned}$$
(13)
where \(r_m=\frac{1-r_0}{5-4}(t_{m,1}-4)+r_0\).
For the correlation structure, we assume that when \(t_{m,1}=5\), there is perfect correlation whereas, the minimum correlation for the lowest ranked reproducible signals (\(t_{m,1}=4\)) is \(r_0\).
As with Simulation I, the test statistics for irreproducible metabolites are generated from the bivariate Normal distribution assuming that the two test statistics are independent,
$$\begin{aligned} \left( \begin{array}{c} t_{m,1}\\ t_{m,2} \end{array}\right)\sim & {} {\mathcal {N}} \left[ \left( \begin{array}{c} \mu _{IR}\\ \mu _{IR} \end{array}\right) ,\left( \begin{array}{cc} \sigma _{IR}^2 &{} 0\\ 0 &{} \sigma _{IR}^2 \end{array}\right) \right] . \end{aligned}$$
We consider 12 parameter settings for this simulation design by varying the proportion of reproducible signals, \(\pi _1\,\in \,\{0.2,\,0.4,\,0.75,\,0.9\}\) and the minimum correlations \(r_0\,\in \,\{0.4,\,0.6,\,0.99\}\) as selected above. For each setting, we simulate 1000 data sets and apply the MaRR procedure. To imitate the MS-Metabolomics data more closely for each data set, the size of metabolites is chosen to be \(M=2860\), which is the number of metabolites for the BioTech data set.
For each simulated data set, we compute the empirical FDR based on MaRR independently by dividing the number of false positives by the total number of reproducible signals. Similarly, we also compute the empirical NDR by dividing the true negatives by the total number of true reproducible signals.
Settings for simulation III
This simulation design is kept exactly similar to the settings for Simulation I except the test statistics for irreproducible signals are generated from the standard bivariate t distribution with degrees of freedom 3 assuming that the two test statistics are independent. The Standard t-distribution with degrees of freedom has heavier tails compared to Standard Normal distribution.