In this paper, a kernel weight based outlier-robust volcano plot is developed for detecting differential metabolites. To reduce the family wise error rate when comparing the performance of the proposed method with existing differential metabolite identification techniques, the p-values are adjusted using Bonferroni correction. The algorithm for outlier-robust volcano plot is given below.
Outlier-robust volcano plot (proposed)
We extend volcano plot by introducing a kernel weight function behind CVP. Classical volcano plot identifies differential metabolites using the t-test and fold-change (FC) methods, and plots log2 (fold-change) on the X-axis against -log10 (p-value) from the t-test on the Y-axis. Because the t-statistic depends on mean and variance and fold-change depends on mean, CVP is heavily influenced by outliers. Therefore, we use the kernel weighted average and variance instead of the classical mean and variance in the t-statistic and fold-change functions, and also plot log
2
fold-change on the X-axis and -log10 (p-value) from the t-test on the Y-axis. We refer to this procedure as robust volcano plot (RVP).
Let X = (x
ij
); i = 1, 2, …, p and j = 1, 2, …, n, be a metabolomics data matrix with p metabolites and n samples. The rows and columns of X represent the metabolites and samples, respectively. In metabolomics data analysis, differential metabolites are the metabolites that show different concentrations between two groups (disease and control) of samples in a metabolomics dataset. According to the control and disease groups, the dataset can be expressed as
$$ X=\left[\begin{array}{c}\overset{Control}{\overbrace{x_{11}\kern0.5em {x}_{12}\kern0.5em \cdots \kern0.5em {x}_{1{g}_1}}}\\ {}{x}_{21}\kern0.5em {x}_{22}\kern0.5em \cdots \kern0.5em {x}_{2{g}_1}\\ {}\begin{array}{cccc}\vdots \kern0.75em & \vdots \kern0.5em & \ddots & \vdots \end{array}\\ {}\begin{array}{cccc}{x}_{p1}& {x}_{p2}& \cdots & {x}_{pg_1}\end{array}\end{array}\kern0.5em \begin{array}{c}\overset{Disease}{\overbrace{\begin{array}{cccc}{x}_{1\left({g}_1+1\right)}& {x}_{1\left({g}_1+2\right)}& \cdots & {x}_{1n}\end{array}}}\\ {}\begin{array}{cccc}{x}_{2\left({g}_1+1\right)}& {x}_{2\left({g}_1+2\right)}& \cdots & {x}_{2n}\end{array}\\ {}\begin{array}{cccc}\vdots \kern1.5em & \vdots \kern1.5em & \ddots & \vdots \end{array}\\ {}\begin{array}{cccc}{x}_{p\left({g}_1+1\right)}& {x}_{p\left({g}_1+2\right)}& \cdots & {x}_{pn}\end{array}\end{array}\right], $$
where g1 is the number of subjects in the control group and (n − g1) is the number of subjects in the disease group. In CVP, log
2
(fold-change) and -log10(p-value) from the t-test are calculated as follows.
The log
2
(fold-change) value for the i-th metabolite is
$$ {\log}_2\left({FC}_i\right)={\log}_2\left(\frac{{\overline{X}}_i^D}{{\overline{X}}_i^C}\right), $$
(1)
where \( {\overline{X}}_i^D \) represents the average intensity of the i-th metabolite for the disease group and \( {\overline{X}}_i^C \) represents the average intensity of the i-th metabolite for the control group.
The t-statistic for testing the hypothesis that the i-th metabolite is not differential, i.e.
$$ {H}_0:\kern0.75em {\mu}_i^C={\mu}_i^D\kern1.5em \mathrm{against}\kern1.5em {H}_1:\kern1em {\mu}_i^C\ne {\mu}_i^D, $$
for σ
iC
2 = σ
iD
2 is
$$ t=\frac{{\overline{X}}_i^C-{\overline{X}}_i^D}{\sqrt{S_i^2\left(\frac{1}{g_1}+\frac{1}{n-{g}_1}\right)}} $$
(2)
where
$$ {\displaystyle \begin{array}{l}{\overline{X}}_i^C=\frac{\sum \limits_{j=1}^{g_1}{x}_{ij}}{g_1};\kern1.25em {\overline{X}}_i^D=\frac{\sum \limits_{j={g}_1+1}^n{x}_{ij}}{n-{g}_1};\kern0.5em {S}_{iC}^2=\frac{1}{g_1-1}\sum \limits_{j=1}^{g_1}{\left({x}_{ij}-{\overline{X}}_i^C\right)}^2\kern0.75em ;\kern0.5em {S}_{iD}^2=\frac{1}{n-{g}_1-1}\sum \limits_{j={g}_1+1}^n{\left({x}_{ij}-{\overline{X}}_i^D\right)}^2\\ {};{S}_i^2=\frac{\left({g}_1-1\right){S}_{iC}^2+\left(n-{g}_1-1\right){S}_{iD}^2}{n-2}.\end{array}} $$
The value from eq. (2) is compared with Student’s t-value with n − 2 degrees of freedom.
If (σ
iC
2 ≠ σ
iD
2) then the test statistic is
$$ t=\frac{{\overline{X}}_i^C-{\overline{X}}_i^D}{\sqrt{\left(\frac{S_{iC}^2}{g_1}+\frac{S_{iD}^2}{n-{g}_1}\right)}} $$
(3)
In both cases, the p-value is calculated using
$$ p-\mathrm{value}=\underset{t_{calculated}}{\overset{\infty }{\int }}f(t) dt $$
(4)
In CVP, the FC value from (1) and t-value from (2) or (3) are calculated using the classical mean and variance. Because the classical mean and variance are influenced by outliers, we propose RVP using the weighted mean and variance instead of the classical mean and variance. For the weighted mean and variance, we use the kernel weight function \( {w}_j=\exp \left\{-\frac{\lambda }{2{\left( mad\left({x}_j\right)\right)}^2}{\left({x}_{ij}- median\left({x}_j\right)\right)}^2\right\}, \) where mad is the median absolute deviation. The value of this weight function lies between zero and one, and is close to zero if the observation is far from the median and close to one if the observation is close to the median. In the weight function, the tuning parameter λ is selected using k-fold cross validation (Fig. 1 summarizes the λ selection procedure). If the dataset does not contain outliers, then the value of λ is zero and all the weights are equal to 1, so the method is the same as the classical approach. The steps for RVP are given below.
-
Step − 1. Calculate log
2
(fold change) for the i-th metabolite as \( {\log}_2\left({FC}_i\right)={\log}_2\left(\frac{{\overline{X}}_i^D}{{\overline{X}}_i^C}\right), \) where \( {\overline{X}}_i^D=\sum \limits_{j={g}_1+1}^n{w}_j{x}_{ij}/\left(n-{g}_1\right) \) represents the weighted average intensity of the i-th metabolite for the disease group and \( {\overline{X}}_i^C=\sum \limits_{j=1}^{g_1}{w}_j{x}_{ij}/{g}_1 \) represents the weighted average intensity of the i-th metabolite for the control group.
-
Step − 2. Using the weighted average and weighted variance instead of the classical mean and variance, calculate -log10(p-value) for the i-th metabolite from the t-test using eqs. (2), (3) and (4), where \( {S}_{iC}^2=\sum \limits_{j=1}^{g_1}{w}_j{\left({x}_{ij}-{\overline{X}}_i^C\right)}^2/\left({g}_1-1\right) \) and \( {S}_{iD}^2=\sum \limits_{j={g}_1+1}^n{w}_j{\left({x}_{ij}-{\overline{X}}_i^D\right)}^2/\left(n-{g}_1-1\right). \)
-
Step − 3. Draw a scatter plot with log2 (fold-change) on the X-axis and -log10 (p-value) from the t-test on the Y-axis. This plot is considered to be an outlier-robust volcano plot (RVP). A metabolite is said to be differential if p-value < 0.05 and | log2 (fold-change) | > 1.
The R package of the proposed method with its user manual is available at https://github.com/nishithkumarpaul/Rvolcano.
Any user can install the “Rvolcano” package in R platform from the GitHub using the following code
library(devtools)
install_github("Rvolcano","nishithkumarpaul")
library(Rvolcano)
To draw the robust volcano plot using the package, the user manual is available at GitHub website.
Dataset description
In this paper, we use an artificially generated dataset and an experimentally measured metabolomics dataset to evaluate the performance of the proposed method in comparison with nine other methods.
Artificial data
In this study, as in [6], we generate an artificial metabolomics dataset using a one-way ANOVA model y
ijk
= μ
i
+ g
ij
+ ∈
ijk
, where y
ijk
is the intensity of the ith metabolite, jth group and kth sample, μ
i
denotes the overall intensity of metabolite i, g
ij
is the jth group effect for the ith metabolite, and ∈
ijk
is a random error term. In this linear model, μ
i
~ uniform (10, 20) and ∈
ijk
~ N(0,1). The disease and control group effects for increased concentrations of metabolites are g
ij
= N(4, 1) and g
ij
= N(2, 1), respectively; for decreased concentrations of metabolites, we use g
ij
= N(2, 1) and g
ij
= N(4, 1) for the disease and control groups, respectively. Both group effects for non-differential (equal concentration) metabolites are g
ij
= N(0, 1). To create the artificial metabolomics dataset, we designated 130 metabolites as non-differential and 20 metabolites as differential (having differential concentrations). The dataset contained 70 subjects with 40 subjects in group-1 and 30 subjects in group-2. To investigate the performance of the proposed method under different conditions, outliers were randomly distributed in the artificially generated data matrix at different rates (5%, 10%, 15%, 20%, and 25%). Note that these outliers can fall anywhere in the data matrix. The outliers for the i-th metabolite were taken from a normal distribution with mean 3*μ
i
and variance \( {\sigma}_i^2 \), i.e. N (3*μ
i
, \( {\sigma}_i^2 \)), where μi and \( {\sigma}_i^2 \) are the mean and variance of the i-th metabolite. In total, 500 artificial datasets were generated for each condition, and the performance of the proposed method was evaluated using these datasets.
Experimentally measured data
In this paper, we considered a well-known publicly available metabolomics dataset for breast cancer serum data and control serum data containing metabolite abundance level measurements from different subjects. This dataset is available from the National Institute of Health (NIH) data repository and was collected by the University of Hawaii Cancer Center under study ID ST000356. The data were measured using a gas chromatography with time-of-flight mass spectrometry (GC-TOFMS) instrument and quantified using the ChromaTOF software (v4.33, Leco Co, CA, USA). The dataset contains 134 subjects (103 breast cancer without any treatment and 31 control subjects) and 101 metabolites. Auto-scaling was used to reduce the systematic variation in the dataset. To investigate the performance of the proposed method under different conditions, we also modified the dataset by changing 5%, 10%, and 15% of the real values by N(4 × μ
i
, σ
i
2), where μ
i
and σ
i
2 are the mean and variance of the i-th metabolite in the breast cancer data matrix.