In this paper, a kernel weight based outlierrobust volcano plot is developed for detecting differential metabolites. To reduce the family wise error rate when comparing the performance of the proposed method with existing differential metabolite identification techniques, the pvalues are adjusted using Bonferroni correction. The algorithm for outlierrobust volcano plot is given below.
Outlierrobust volcano plot (proposed)
We extend volcano plot by introducing a kernel weight function behind CVP. Classical volcano plot identifies differential metabolites using the ttest and foldchange (FC) methods, and plots log_{2} (foldchange) on the Xaxis against log_{10} (pvalue) from the ttest on the Yaxis. Because the tstatistic depends on mean and variance and foldchange depends on mean, CVP is heavily influenced by outliers. Therefore, we use the kernel weighted average and variance instead of the classical mean and variance in the tstatistic and foldchange functions, and also plot log_{
2
} foldchange on the Xaxis and log_{10} (pvalue) from the ttest on the Yaxis. We refer to this procedure as robust volcano plot (RVP).
Let X = (x_{
ij
}); i = 1, 2, …, p and j = 1, 2, …, n, be a metabolomics data matrix with p metabolites and n samples. The rows and columns of X represent the metabolites and samples, respectively. In metabolomics data analysis, differential metabolites are the metabolites that show different concentrations between two groups (disease and control) of samples in a metabolomics dataset. According to the control and disease groups, the dataset can be expressed as
$$ X=\left[\begin{array}{c}\overset{Control}{\overbrace{x_{11}\kern0.5em {x}_{12}\kern0.5em \cdots \kern0.5em {x}_{1{g}_1}}}\\ {}{x}_{21}\kern0.5em {x}_{22}\kern0.5em \cdots \kern0.5em {x}_{2{g}_1}\\ {}\begin{array}{cccc}\vdots \kern0.75em & \vdots \kern0.5em & \ddots & \vdots \end{array}\\ {}\begin{array}{cccc}{x}_{p1}& {x}_{p2}& \cdots & {x}_{pg_1}\end{array}\end{array}\kern0.5em \begin{array}{c}\overset{Disease}{\overbrace{\begin{array}{cccc}{x}_{1\left({g}_1+1\right)}& {x}_{1\left({g}_1+2\right)}& \cdots & {x}_{1n}\end{array}}}\\ {}\begin{array}{cccc}{x}_{2\left({g}_1+1\right)}& {x}_{2\left({g}_1+2\right)}& \cdots & {x}_{2n}\end{array}\\ {}\begin{array}{cccc}\vdots \kern1.5em & \vdots \kern1.5em & \ddots & \vdots \end{array}\\ {}\begin{array}{cccc}{x}_{p\left({g}_1+1\right)}& {x}_{p\left({g}_1+2\right)}& \cdots & {x}_{pn}\end{array}\end{array}\right], $$
where g_{1} is the number of subjects in the control group and (n − g_{1}) is the number of subjects in the disease group. In CVP, log_{
2
}(foldchange) and log_{10}(pvalue) from the ttest are calculated as follows.
The log_{
2
} (foldchange) value for the ith metabolite is
$$ {\log}_2\left({FC}_i\right)={\log}_2\left(\frac{{\overline{X}}_i^D}{{\overline{X}}_i^C}\right), $$
(1)
where \( {\overline{X}}_i^D \) represents the average intensity of the ith metabolite for the disease group and \( {\overline{X}}_i^C \) represents the average intensity of the ith metabolite for the control group.
The tstatistic for testing the hypothesis that the ith metabolite is not differential, i.e.
$$ {H}_0:\kern0.75em {\mu}_i^C={\mu}_i^D\kern1.5em \mathrm{against}\kern1.5em {H}_1:\kern1em {\mu}_i^C\ne {\mu}_i^D, $$
for σ_{
iC
}^{2} = σ_{
iD
}^{2} is
$$ t=\frac{{\overline{X}}_i^C{\overline{X}}_i^D}{\sqrt{S_i^2\left(\frac{1}{g_1}+\frac{1}{n{g}_1}\right)}} $$
(2)
where
$$ {\displaystyle \begin{array}{l}{\overline{X}}_i^C=\frac{\sum \limits_{j=1}^{g_1}{x}_{ij}}{g_1};\kern1.25em {\overline{X}}_i^D=\frac{\sum \limits_{j={g}_1+1}^n{x}_{ij}}{n{g}_1};\kern0.5em {S}_{iC}^2=\frac{1}{g_11}\sum \limits_{j=1}^{g_1}{\left({x}_{ij}{\overline{X}}_i^C\right)}^2\kern0.75em ;\kern0.5em {S}_{iD}^2=\frac{1}{n{g}_11}\sum \limits_{j={g}_1+1}^n{\left({x}_{ij}{\overline{X}}_i^D\right)}^2\\ {};{S}_i^2=\frac{\left({g}_11\right){S}_{iC}^2+\left(n{g}_11\right){S}_{iD}^2}{n2}.\end{array}} $$
The value from eq. (2) is compared with Student’s tvalue with n − 2 degrees of freedom.
If (σ_{
iC
}^{2} ≠ σ_{
iD
}^{2}) then the test statistic is
$$ t=\frac{{\overline{X}}_i^C{\overline{X}}_i^D}{\sqrt{\left(\frac{S_{iC}^2}{g_1}+\frac{S_{iD}^2}{n{g}_1}\right)}} $$
(3)
In both cases, the pvalue is calculated using
$$ p\mathrm{value}=\underset{t_{calculated}}{\overset{\infty }{\int }}f(t) dt $$
(4)
In CVP, the FC value from (1) and tvalue from (2) or (3) are calculated using the classical mean and variance. Because the classical mean and variance are influenced by outliers, we propose RVP using the weighted mean and variance instead of the classical mean and variance. For the weighted mean and variance, we use the kernel weight function \( {w}_j=\exp \left\{\frac{\lambda }{2{\left( mad\left({x}_j\right)\right)}^2}{\left({x}_{ij} median\left({x}_j\right)\right)}^2\right\}, \) where mad is the median absolute deviation. The value of this weight function lies between zero and one, and is close to zero if the observation is far from the median and close to one if the observation is close to the median. In the weight function, the tuning parameter λ is selected using kfold cross validation (Fig. 1 summarizes the λ selection procedure). If the dataset does not contain outliers, then the value of λ is zero and all the weights are equal to 1, so the method is the same as the classical approach. The steps for RVP are given below.

Step − 1. Calculate log_{
2
} (fold change) for the ith metabolite as \( {\log}_2\left({FC}_i\right)={\log}_2\left(\frac{{\overline{X}}_i^D}{{\overline{X}}_i^C}\right), \) where \( {\overline{X}}_i^D=\sum \limits_{j={g}_1+1}^n{w}_j{x}_{ij}/\left(n{g}_1\right) \) represents the weighted average intensity of the ith metabolite for the disease group and \( {\overline{X}}_i^C=\sum \limits_{j=1}^{g_1}{w}_j{x}_{ij}/{g}_1 \) represents the weighted average intensity of the ith metabolite for the control group.

Step − 2. Using the weighted average and weighted variance instead of the classical mean and variance, calculate log_{10}(pvalue) for the ith metabolite from the ttest using eqs. (2), (3) and (4), where \( {S}_{iC}^2=\sum \limits_{j=1}^{g_1}{w}_j{\left({x}_{ij}{\overline{X}}_i^C\right)}^2/\left({g}_11\right) \) and \( {S}_{iD}^2=\sum \limits_{j={g}_1+1}^n{w}_j{\left({x}_{ij}{\overline{X}}_i^D\right)}^2/\left(n{g}_11\right). \)

Step − 3. Draw a scatter plot with log_{2} (foldchange) on the Xaxis and log_{10} (pvalue) from the ttest on the Yaxis. This plot is considered to be an outlierrobust volcano plot (RVP). A metabolite is said to be differential if pvalue < 0.05 and  log_{2} (foldchange)  > 1.
The R package of the proposed method with its user manual is available at https://github.com/nishithkumarpaul/Rvolcano.
Any user can install the “Rvolcano” package in R platform from the GitHub using the following code
library(devtools)
install_github("Rvolcano","nishithkumarpaul")
library(Rvolcano)
To draw the robust volcano plot using the package, the user manual is available at GitHub website.
Dataset description
In this paper, we use an artificially generated dataset and an experimentally measured metabolomics dataset to evaluate the performance of the proposed method in comparison with nine other methods.
Artificial data
In this study, as in [6], we generate an artificial metabolomics dataset using a oneway ANOVA model y_{
ijk
} = μ_{
i
} + g_{
ij
} + ∈_{
ijk
}, where y_{
ijk
} is the intensity of the i^{th} metabolite, j^{th} group and k^{th} sample, μ_{
i
} denotes the overall intensity of metabolite i, g_{
ij
} is the j^{th} group effect for the i^{th} metabolite, and ∈_{
ijk
} is a random error term. In this linear model, μ_{
i
} ~ uniform (10, 20) and ∈_{
ijk
} ~ N(0,1). The disease and control group effects for increased concentrations of metabolites are g_{
ij
} = N(4, 1) and g_{
ij
} = N(2, 1), respectively; for decreased concentrations of metabolites, we use g_{
ij
} = N(2, 1) and g_{
ij
} = N(4, 1) for the disease and control groups, respectively. Both group effects for nondifferential (equal concentration) metabolites are g_{
ij
} = N(0, 1). To create the artificial metabolomics dataset, we designated 130 metabolites as nondifferential and 20 metabolites as differential (having differential concentrations). The dataset contained 70 subjects with 40 subjects in group1 and 30 subjects in group2. To investigate the performance of the proposed method under different conditions, outliers were randomly distributed in the artificially generated data matrix at different rates (5%, 10%, 15%, 20%, and 25%). Note that these outliers can fall anywhere in the data matrix. The outliers for the ith metabolite were taken from a normal distribution with mean 3*μ_{
i
} and variance \( {\sigma}_i^2 \), i.e. N (3*μ_{
i
}, \( {\sigma}_i^2 \)), where μ_{i} and \( {\sigma}_i^2 \) are the mean and variance of the ith metabolite. In total, 500 artificial datasets were generated for each condition, and the performance of the proposed method was evaluated using these datasets.
Experimentally measured data
In this paper, we considered a wellknown publicly available metabolomics dataset for breast cancer serum data and control serum data containing metabolite abundance level measurements from different subjects. This dataset is available from the National Institute of Health (NIH) data repository and was collected by the University of Hawaii Cancer Center under study ID ST000356. The data were measured using a gas chromatography with timeofflight mass spectrometry (GCTOFMS) instrument and quantified using the ChromaTOF software (v4.33, Leco Co, CA, USA). The dataset contains 134 subjects (103 breast cancer without any treatment and 31 control subjects) and 101 metabolites. Autoscaling was used to reduce the systematic variation in the dataset. To investigate the performance of the proposed method under different conditions, we also modified the dataset by changing 5%, 10%, and 15% of the real values by N(4 × μ_{
i
}, σ_{
i
}^{2}), where μ_{
i
} and σ_{
i
}^{2} are the mean and variance of the ith metabolite in the breast cancer data matrix.