 Methodology Article
 Open Access
 Published:
SDA: a semiparametric differential abundance analysis method for metabolomics and proteomics data
BMC Bioinformatics volume 20, Article number: 501 (2019)
Abstract
Background
Identifying differentially abundant features between different experimental groups is a common goal for many metabolomics and proteomics studies. However, analyzing data from mass spectrometry (MS) is difficult because the data may not be normally distributed and there is often a large fraction of zero values. Although several statistical methods have been proposed, they either require the data normality assumption or are inefficient.
Results
We propose a new semiparametric differential abundance analysis (SDA) method for metabolomics and proteomics data from MS. The method considers a twopart model, a logistic regression for the zero proportion and a semiparametric loglinear model for the possibly nonnormally distributed nonzero values, to characterize data from each feature. A kernelsmoothed likelihood method is developed to estimate model coefficients and a likelihood ratio test is constructed for differential abundant analysis. The method has been implemented into an R package, SDAMS, which is available at https://www.bioconductor.org/packages/release/bioc/html/SDAMS.html.
Conclusion
By introducing the twopart semiparametric model, SDA is able to handle both nonnormally distributed data and large fraction of zero values in a MS dataset. It also allows for adjustment of covariates. Simulations and real data analyses demonstrate that SDA outperforms existing methods.
Background
Mass spectrometry (MS) has been widely used to profile abundances of metabolomic or proteomic features in biological samples [1]. A common goal of many MSbased studies is to identify features [2, 3] that have different abundances under different experimental groups. For example, in a lung cancer exosomal lipids dataset generated from the Resource Center for Stable IsotopeResolved Metabolomics at the University of Kentucky, a total of 39 latestage lung cancer and 27 normal samples were analyzed using Fouriertransform mass spectrometry. The abundances of 282 lipid features were measured. One goal of the study is to identify lipid features that were differentially abundant between lung cancer and normal samples.
The MS data sets often contain a large fraction of zero values [4, 5]. For example, in the aforementioned lung cancer exosomal lipid dataset, 40.1% of the observed values were zeros. The distribution of zero value proportion across metabolomic features is presented in Fig. 1a. These zero values indicate the absence or below the detection limit of certain metabolites in certain samples. The existence of these zero values complicates data analysis. Firstly, simply ignoring them would lead to biased results [6, 7]. Secondly, as the data comprise a mixture of a point mass at zero intensity and a distribution of nonzero values, standard statistical methods, such as the twosample ttest, are inappropriate. To better characterize the data, twopart models, which use one model to quantify the zero proportion and the other model to characterize the nonzero values, have been proposed. Lachenbruch [7] and Taylor and Pullard [8] presented several twopart tests, including the twopart t, twopart Wilcoxon and twopart empirical likelihood ratio tests.
Another challenge with the MS data is that the (logtransformed) nonzero values are often nonnormally distributed. We applied the ShapiroWilk test of normality to each metabolite with at least 20 nonzero values in the lung cancer exosomal lipid dataset. Figure 1b shows the distribution of resulting pvalues. More than 8% of the pvalues were less than 0.01, strongly indicating that the abundance data were not normally distributed for at least a substantial number of metabolites. Therefore, differential abundance analysis methods that fit a normal model for the nonzero values of each metabolite, e.g. a twopart ttest [7, 8], are inappropriate and may yield unreliable pvalues for those nonnormally distributed metabolites. As a result, the selection of differentially abundant metabolites is also biased as it is based on the rankings of those suspicious pvalues that do not compare the significance of different metabolites in a fair and robust manner. Nonparametric methods, such as the twopart Wilcoxon test [7, 8] and empirical likelihood ratio test [8], have also been proposed. However, the tests themselves do not provide a clear quantification of the effect size, do not allow for adjustment of covariates, and may be inefficient.
In this paper, we propose a new semiparametric differential abundance analysis (SDA) method for proteomics and metabolomics data from mass spectrometry. Our method considers a twopart semiparametric model to address the issues mentioned above. For the zero part, we consider a logistic regression model which is asymptotically equivalent to the chisquared test when there is only one categorical experimental factor. For the nonzero part, we consider a semiparametric loglinear model, which assumes a linear effect of experimental factors on the logtransformed feature abundance but allows an arbitrary distribution for the random error term. The semiparametric loglinear model has been introduced for survival data, where it is called the semiparametric accelerated failure time (AFT) model [9]. To our knowledge, this is the first time this model has been utilized for proteomics and metabolomics data, where it is especially attractive because of the ability to handle nonnormally distributed data and the direct scientific interpretation of model parameters. In addition, we propose a kernelsmoothed likelihood method to estimate regression coefficients and construct a likelihood ratio test for differential abundant analysis. We evaluate the performance of our method using simulation studies and real data analyses.
Methods
Our goal is to identify metabolomic or proteomic features that are differentially abundant between experimental groups. As we described in the previous section, MS data comprise a mixture of zero intensity values and possibly nonnormally distributed nonzero intensity values. Therefore, the differential abundance analysis needs to be performed to compare both the zero proportion and the mean of nonzero values between groups. To accomplish this, we propose SDA, which considers a twopart semiparametric model that uses a logistic regression model to characterize the zero proportion and a semiparametric loglinear model to characterize possibly nonnormally distributed nonzero values.
A twopart semiparametric model
Let Y_{ig} be the random variable representing the observed abundance of feature g in subject i (i=1,2,...,N). The distribution for Y_{ig} consists a point mass at zero and a continuous distribution on positive values. We begin by introducing a logistic regression model for the zero part. Let π_{ig}=Pr(Y_{ig}=0) be the point mass. We consider
where X_{i}=(X_{i1},X_{i2},...,X_{iQ})^{T} is a Qvector of covariates for subject i. The corresponding Qvector of model parameters γ_{g}=(γ_{1g},γ_{2g},...,γ_{Qg})^{T} quantify covariates’ effects on the fraction of zero values for feature g and γ_{0g} is the intercept.
For the continuous nonzero part, i.e. Y_{ig}>0, we consider a semiparametric model:
The model parameters β_{g}=(β_{1g},β_{2g},...,β_{Qg})^{T} have a direct and clear scientific interpretation, i.e. β_{qg} is the log fold change in observed nonzero abundance comparing different values of the qth covariate for feature g. The ε_{ig}’s (i=1,2,..N) are independent error terms with a common but completely unspecified density function f_{g}. Importantly, we do not impose any distributional assumption on f_{g}. Therefore, our semiparametric model only specifies a linear effect of covariates, but allows the error term to be arbitrarily distributed. If we further assume ε_{ig} following a normal distribution, this model reduces to a regular linear regression model on log(Y_{ig}). However, without assuming a specific parametric distribution for ε_{ig}, our model is much more flexible to characterize data with unknown and possibly nonnormal distribution.
Estimation of model parameters
We propose a likelihoodbased approach to estimate model parameters. The likelihood function for the two models jointly is:
where δ_{ig}=I{Y_{ig}=0} is an indicator function of zero value. Directly calculating the maximum likelihood estimate from this model is intractable because the likelihood involves an infinitedimensional nuisance parameter f_{g}, which is a common challenge for semiparametric model inference. A popular approach to overcome this challenge is the nonparametric maximum likelihood (NPML) method [10]. The NPML method restricts the cumulative distribution function of the error term to be a step function and therefore reduces the parameters in the likelihood to finitedimensional. Then a profile likelihood for the parameters of interest is calculated and the NPML estimate of the parameters of interest is obtained by maximizing the profile likelihood. This approach, however, is infeasible for the semiparametric model considered here because the resulting profile likelihood depends on the ranks of log(Y_{ig})−β_{g}X_{i} and is very nonsmooth so that the maximization point of it is unattainable [11].
To address this problem, we replace ε_{ig}’s density function f_{g}(x) by its kernel density estimator \(1/(Nh)\sum _{j=1}^{n}K\{(\text {log}(Y_{j})\boldsymbol {\beta }_{g} \boldsymbol {X}_{j} x)/h\}\), where K(·) is a one dimensional kernel function, such as the Gaussian kernel, with bandwidth h. Thus, we obtain the following kernelsmoothed approximation of the likelihood in Eq. (1):
This kernelsmoothed likelihood includes only a finite number of model parameters. Importantly, this function is very smooth in (γ_{0g},γ_{g},β_{g}), and thus the maximum likelihood estimator, (\(\hat {\gamma }_{0g}, \hat {\boldsymbol {\gamma }}_{g}, \hat {\boldsymbol {\beta }}_{g}\)), can be easily obtained through a trust region maximization algorithm or other NewtonRaphson gradientbased search algorithm [11–14].
Identification of differentially abundant features
Hypothesis testing on the effect of the qth covariate on the gth feature is performed by assessing γ_{qg} and β_{qg}. Consider the null hypothesis H_{0}:γ_{qg}=0 and β_{qg}=0 against alternative hypothesis H_{1}: at least one of the two parameters is nonzero. We propose a likelihood ratio test (LRT) to test the hypothesis. The test statistic is:
where (\(\tilde {\gamma }_{0g}, \tilde {\boldsymbol {\gamma }}_{g}, \tilde {\boldsymbol {\beta }}_{g}\)) is the maximization point of the likelihood under H_{0}. The pvalue is calculated based on a chisquare distribution with 2 degrees of freedom. To adjust for multiple comparisons across features, the false discovery rate (FDR) qvalue [15] is calculated based on the qvalue function in the qvalue package in R/Bioconductor.
Results
Simulation studies
We performed comprehensive simulation studies to evaluate the performance of SDA and to compare with three existing methods described in Taylor and Pollard [8]: twopart t test (2T), twopart Wilcoxon test (2W) and empirical likelihood ratio test (ELRT). Because Taylor and Pollard [8] did not provide a method for multiple comparison adjustment for these three methods, we considered the same FDR adjustment method [15] used in SDA to make methods more comparable.
We focused on the twogroup comparison problem and considered two simulation scenarios. For the first scenario, data were simulated based on a prostate cancer proteomics data from the human urinary proteome database [16]. A detailed description of this dataset is provided in the “Real data analyses” section. Each simulated dataset contains 2n subjects and 4,000 features. For each feature, the n observations of group 1 were generated based on a mixture distribution \(p H(x) + (1p) \hat {F}(x)\), where the zero proportion p was generated from Uniform(0,0.8),H(x) was the unit step function, and \(\hat {F}(x)\) was the empirical distribution (in the log scale) of a randomly selected feature that had at least 20 nonzero values in the control group of the proteomics data. For a nondifferentially abundant feature, the n observations of group 2 were generated from the same distribution as of group 1. For a differentially abundant feature, a 2fold difference (β= log(2)), which was also used in one of the simulation studies in [6], was added to the nonzero part of the distribution.
In our simulations, we set n to 50 or 100 and considered 5%, 10% or 20% differentially abundant features. In this section, we only present results from simulations with 10% differentially abundant features. Similar results were obtained for 5% or 20% differentially abundant features (see Additional file 1). For the proposed method, we chose the Gaussian kernel for K(·) which is commonly used in kernel density estimation. For the smoothing parameter h, we used the optimal bandwidth \(h=1.144\hat {\sigma }N^{1/5}\) [17], where N=2n is the total sample size, and \(\hat {\sigma }\) is the sample standard deviation of {log(Y_{ig}),i=1,...N}.
We first compared the performance of different methods in terms of ranking features. Figure 2 shows the true positive rate (TPR) against the number of topranked features based on pvalues for each method. The left column shows results from all features, including both normally and nonnormally distributed. SDA had a higher TPR than all other methods, and the difference increased with sample size. Twopart t and twopart Wilcoxon tests had very similar TPRs, while ELRT had a much lower TPR. The right column shows results from nonnormally distributed features (ShapiroWilk test pvalue <0.01 for at least one of the two groups). Similar to the left column, SDA had the highest TPR, demonstrating its ability to model nonnormally distributed data. The twopart ttest had a lower TPR than the twopart Wilcoxon test as the data normality assumption of the twopart t test was violated for those features.
To further quantify the overall performance of different methods, we calculated the area under the ROC curve (AUC). As shown in Table 1, SDA had the highest AUC values under all scenarios, especially when evaluating on nonnormally distributed features only. The AUCs from twopart Wilcoxon and twopart t tests were close to each other when evaluating on all features, and twopart Wilcoxon had a slightly better AUC when evaluating on nonnormally distributed features only. ELRT had the worst AUCs in all scenarios.
We next assess the accuracy in estimating the FDR for different methods. Figure 3 displays the reported FDR against true FDR. The reported FDR based on SDA and twopart ttest were close to the true FDR, indicating that those methods were able to accurately estimate the FDR. The reported FDR based on the twopart Wilcoxon test was smaller than the true FDR under all scenarios, suggesting that it was conservative in detecting differentially abundant features. The reported FDR based on ELRT was close to the true FDR when n=50, but went larger than the true FDR when n increased to 100.
Figure 4 plots the number of discoveries against a given FDR threshold, which was set to 0.05, 0.1, or, 0.2. For each scenario, we present the total discoveries as well as the false discoveries (shaded area). The SDA method identified more truly differentially abundant features than all other methods at any given threshold.
For the second simulation scenario, data were simulated following the same procedure as the first simulation scenario, but with one additional step of censoring by a detection limit. Specifically, the detection limit for a feature was chosen as the 10th percentile of the simulated nonzero values from the two groups combined. All nonzero values below the detection limit were set to zero to mimic the situation that a fraction of observed zero values were due to detection limit. Data simulated under this scenario had different numbers of zeros between groups for differentially abundant features because the group with lower abundance level of a feature had more values that fell below the detection limit. The results were presented in Figures S715 in Additional file 1. Similar to the first simulation scenario, SDA had a higher true positive rate compared to other methods under this simulation scenario. SDA also identified more truly differentially abundant features than all other methods at any given FDR threshold for nonnormally distributed features.
Real data analyses
Prostate cancer proteomics data
We applied our method to prostate cancer data from the human urinary proteome database [16]. In our analysis, we compared proteomic feature abundances between 526 prostate cancer and 1503 healthy subjects. A total of 5605 proteomic features were measured for each subject, where the abundance measurement had been normalized relative to 29 urinary “housekeeping” peptides to adjust for analytical and urine dilution variances [16, 18, 19]. Figure 5 presents results on analyzing the whole dataset with an FDR threshold of 0.05. The majority of differentially abundant features identified by different methods overlapped, having 3043 features in common. We next evaluated the performance of different methods under smaller sample size, where we subsampled 10% or 20% of the data and calculated the concordance on identified differentially abundant features between the sub and whole datasets. Specifically, we focused on the 3043 features that were commonly identified by all methods from the whole dataset and investigated what fraction of these features could also be identified by each method when analyzing the subdataset. Figure 6 plots the number of discoveries under FDR threshold of 0.05, 0.1 or 0.2. Compared to other methods, SDA based on a subdataset were able to identified a larger number of the 3043 differentially abundant features obtained from the whole dataset, and therefore provided a better concordance between the sub and whole dataset analysis.
Lung cancer exosomal lipids data
We applied our method to the lung cancer exosomal lipids dataset described in the “Background” section. The data acquisition and normalization procedure of this dataset is provided in Additional file 2. Table 2 shows differentially abundant features identified by SDA, twopart t, twopart Wilcoxon, and ELRT tests for the comparison between late stage lung cancer and normal samples. SDA identified a total of 15 differentially abundant features, including all 6 features identified by any of the other three methods and 9 additional features. These features were further characterized by tandem MS, which showed that several ions comprise more than one isobaric species which could be assigned to specific lipids (see Table S1 in Additional file 2). The lipids were dominated by triglycerides, which are typically storage lipids and associated with lung cancer risk based on cohort studies [20, 21]. Some of the acyl chains were long chain (>16) and polyunsaturated, which can be hydrolyzed to bioactive lipids (diacylglycerols and the fatty acids). Also found was a sphingomyelin, which can be important cell signaling regulators [22] with key roles in lung cancer pathogenesis [23].
Discussion
In standard statistical practice, examining data normality is usually the first step of data analysis. If the data is normally distributed, parametric methods, e.g. ttest, will be used. Otherwise, nonparametric tests, e.g. Wilcoxon ranksum test, will be considered. However, for metabolomics and proteomics data with a large number of features, it is more difficult to examine data normality for each of the features, but the choice of an appropriate statistical method depends on it. SDA solves this problem by introducing a unified semiparametric model for both normally and nonnormally distributed data, and therefore providing valid inference without having to check for data normality. SDA possesses merits of both nonparametric and parametric methods. On one hand, it is free of the data normality assumption. On the other hand, it allows quantification of the effect size and adjustment of covariates.
SDA is robust to the choice of bandwidth for moderate to large sample size. But when the sample size is small, choice of bandwidth may have an impact. We evaluated the bandwidth proposed by [17] as well as five other bandwidths described in [11] using simulation studies. We found that the bandwidth \(h=1.144\hat {\sigma }N^{1/5}\) [17] yielded the best performance (data not shown). Therefore, this bandwidth was used in our analysis.
The observed zero values may be a mixture of zeros due to the absence of a compound and values below the detection limit. To deal with values below the detection limit, one frequently used approach is data imputation [24]. However, for MS data, it is unknown whether an observed zero value is due to the absence of a compound or below the detection limit. Data imputation can only be performed on all the observed zero values, which would lead to biased results because zero values due to the absence of a compound would also be imputed with positive values. In fact, it is difficult to distinguish these two types of zeros in statistical inference without imposing additional parametric model assumptions. Therefore, our method, as well as the twopart ttest and twopart Wilcoxon test, focuses on assessing the null hypothesis that the distribution of observed abundance level is the same between groups, i.e. the proportion of observed zero values (including both the absence of a compound and below the detection limit) and the distribution of observed nonzero values (values above the detection limit) are the same between groups. Our alternative hypothesis is that the proportion of observed zero values and/or the distribution of observed nonzero values are different between groups.
For the case of twogroup comparison in presence of detection limit, our test is also a valid test (in terms of preserving the type I error rate) for assessing the null hypothesis that the distribution of underlying abundance level without censoring by the detection limit is the same between groups, i.e. the proportion of zero underlying abundance values and the distribution of nonzero underlying abundance values are the same between groups (see Proposition S1 in Additional file 3). To numerically validate this, we performed a singlefeature simulation study, which showed that our test preserved the type I error rate around 5% (see Table S2 in Additional file 1). Furthermore, as demonstrated by the second simulation scenario in the “Simulation studies” section, our method outperformed other methods in identifying differentially abundant features, especially nonnormally distributed features, under such situation.
This paper focuses on downstream differential abundance analysis of MS data, expecting that the data have already been appropriately processed and normalized. In fact, data normalization is a critical step in MS data processing to adjust size effect, due to the difference in the sample amount or dilution across samples, as well as other technical variations. Various data normalization methods, such as housekeeping normalization [18, 25, 26], centred logratio transformation [25], probabilistic quotient normalization [25, 27], total sum normalization [25], and variance stabilization normalization [27, 28], have been proposed. The choice of an appropriate normalization method depends on the type of biological samples, the study design, and the investigator’s experience. It has been shown that data normalization can substantially affect downstream analysis [25, 28, 29]. Therefore, we highly suggest users to carefully perform data normalization prior to differential abundance analysis.
We consider the case that individual observations are independent of each other in this paper. One of our future directions is to extend SDA to paired data, e.g. comparing metabolomic profiles between paired tumor and normal samples from the same patient. To deal with the correlation between paired samples, we can introduce random effect terms in both the logistic regression and the semiparametric loglinear models. However, the computation of kernelsmoothed likelihood is more complicated.
Conclusion
In this paper, we propose a new differential abundance analysis method, SDA, for metabolomic and proteomic data generated from MS. Based on a twopart semiparametric model, the SDA method is able to robustly handle nonnormally distributed data and to adjust for covariates. Meanwhile, our model provides a direct quantification of the effect size. We develop a kernelsmoothed likelihood procedure for model parameter estimation and a likelihood ratio test for differential abundance analysis. Simulation studies and analyses of proteomics and metabolomics datasets show that SDA outperforms existing methods.
Availability of data and materials
An R package, SDAMS, that implements the proposed method is available at https://www.bioconductor.org/packages/release/bioc/html/SDAMS.html. The code for performing simulation studies and reproducing figures/tables is available at http://sweb.uky.edu/~cwa236/SDA.html.
Abbreviations
 2T:

twopart t test
 2W:

twopart Wilcoxon test
 AFT:

Accelerated failure time
 AUC:

Area under the ROC curve
 ELRT:

Empirical likelihood ratio test
 FDR:

False discovery rate
 LRT:

Likelihood ratio test
 MS:

Mass spectrometry
 NPML:

Nonparametric maximum likelihood
 ROC:

Receiver operator characteristic
 SDA:

Semiparametric differential abundance analysis
 TPR:

True positive rate
References
 1
Want EJ, Cravatt BF, Siuzdak G. The expanding role of mass spectrometry in metabolite profiling and characterization. Chembiochem. 2005; 6(11):1941–51.
 2
Cottrell JS. Protein identification using ms/ms data. J Proteome. 2011; 74(10):1842–51.
 3
Xi B, Gu H, Baniasadi H, Raftery D. Statistical analysis and modeling of mass spectrometrybased metabolomics data. In: Mass Spectrom Metabolomics. New York: Humana Press: 2014. p. 333–53.
 4
Nie L, Wu G, Brockman FJ, Zhang W. Integrated analysis of transcriptomic and proteomic data of desulfovibrio vulgaris: zeroinflated poisson regression models to predict abundance of undetected proteins. Bioinformatics. 2006; 22(13):1641–7.
 5
Lazar C, Gatto L, Ferro M, Bruley C, Burger T. Accounting for the multiple natures of missing values in labelfree quantitative proteomics data sets to compare imputation strategies. J Proteome Res. 2016; 15(4):1116–25.
 6
Gleiss A, Dakna M, Mischak H, Heinze G. Twogroup comparisons of zeroinflated intensity values: the choice of test statistic matters. Bioinformatics. 2015; 31(14):2310–7.
 7
Lachenbruch PA. Comparisons of twopart models with competitors. Stat Med. 2001; 20(8):1215–34.
 8
Taylor S, Pollard K. Hypothesis tests for pointmass mixture data with application toomics data with many zero values. Stat Appl Genet Mol Biol. 2009; 8(1):1–43.
 9
Kalbfleisch JD, Prentice RL, Vol. 360. The statistical analysis of failure time data. Hoboken: Wiley; 2002.
 10
Groeneboom P, Wellner JA. Information Bounds and Nonparametric Maximum Likelihood Estimation, vol. 19. Basel: Birkhauser Verlag; 1992.
 11
Zeng D, Lin D. Efficient estimation for the accelerated failure time model. J Am Stat Assoc. 2007; 102(480):1387–96.
 12
Ionides E. Maximum smoothed likelihood estimation. Stat Sin. 2005; 15(4):1003–14.
 13
Groeneboom P, Jongbloed G, Witte BI, et al.Maximum smoothed likelihood estimation and smoothed maximum likelihood estimation in the current status model. Ann Stat. 2010; 38(1):352–87.
 14
Groeneboom P, et al.Maximum smoothed likelihood estimators for the interval censoring model. Ann Stat. 2014; 42(5):2092–137.
 15
Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci. 2003; 100(16):9440–5.
 16
Siwy J, Mullen W, Golovko I, Franke J, Zürbig P. Human urinary peptide database for multiple disease biomarker discovery. PROTEOMICSClin Appl. 2011; 5(56):367–74.
 17
Sheather SJ, et al. Density estimation. Stat Sci. 2004; 19(4):588–97.
 18
JantosSiwy J, Schiffer E, Brand K, Schumann G, Rossing K, Delles C, Mischak H, Metzger J. Quantitative urinary proteome analysis for biomarker evaluation in chronic kidney disease. J Proteome Res. 2008; 8(1):268–81.
 19
Good DM, Zürbig P., Argiles A, Bauer HW, Behrens G, Coon JJ, Dakna M, Decramer S, Delles C, Dominiczak AF, et al.Naturally occurring human urinary peptides for use in diagnosis of chronic kidney disease. Mol Cell Proteomics. 2010; 9(11):2424–37.
 20
Lin X, Lu L, Liu L, Wei S, He Y, Chang J, Lian X. Blood lipids profile and lung cancer risk in a metaanalysis of prospective cohort studies. Journal of clinical lipidology. 2017; 11(4):1073–81.
 21
Ulmer H, Borena W, Rapp K, Klenk J, Strasak A, Diem G, Concin H, Nagel G. Serum triglyceride concentrations and cancer risk in a large cohort study in austria. Br J Cancer. 2009; 101(7):1202.
 22
Ogretmen B. Sphingolipid metabolism in cancer signalling and therapy. Nat Rev Cancer. 2018; 18(1):33.
 23
Bieberich E, Wang G. Sphingolipid in lung cancer pathogenesis and therapy. In: A Global Scientific VisionPrevention, Diagnosis, and Treatment of Lung Cancer. IntechOpen: 2017.
 24
PalareaAlbaladejo J, MartinFernandez JA. zcompositions—r package for multivariate imputation of leftcensored data under a compositional approach. Chemometr Intell Lab Syst. 2015; 143:85–96.
 25
Gardlo A, Smilde AK, Hron K, Hrda M, Karlikova R, Friedeckỳ D, Adam T. Normalization techniques for parafac modeling of urine metabolomic data. Metabolomics. 2016; 12(7):117.
 26
Wu Y, Li L. Sample normalization methods in quantitative metabolomics. J Chromatogr A. 2016; 1430:80–95.
 27
Li B, Tang J, Yang Q, Li S, Cui X, Li Y, Chen Y, Xue W, Li X, Zhu F. Noreva: normalization and evaluation of msbased metabolomics data. Nucleic Acids Res. 2017; 45(W1):162–70.
 28
Välikangas T, Suomi T, Elo LL. A systematic evaluation of normalization methods in quantitative labelfree proteomics. Brief Bioinform. 2016; 19(1):1–11.
 29
Thongboonkerd V. Practical points in urinary proteomics. J Proteome Res. 2007; 6(10):3881–90.
Acknowledgements
We thank Xiaofei Zhang for MS1 spectral binning and normalization of the lung cancer exosomal lipids data.
Funding
This work was supported by National Institutes of Health [1R03CA211835, 5P20GM10343615, 1P01CA16322301A1], the Biostatistics and Bioinformatics and Redox Metabolism Shared Resource Facilities of the University of Kentucky Markey Cancer Center [P30CA177558]. The National Institutes of Health played no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
Author information
Affiliations
Contributions
LC, CW and AJS designed the study. YL, CW and LC derived the method. YL implemented the method and performed simulation studies and real data analyses. TWMF, ANL, WYK, and SMA provided the lung cancer exosomal lipids data and interpreted the data analysis results, YL, ANL, WYK, AJS, CW and LC wrote the manuscript. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1
Additional simulation results. This file provided additional simulation results with 5% or 20% differentially abundant features for the first simulation scenario described in the main text. We compared SDA to 2T, 2W and ELRT methods for true positive rate, FDR and number of significant features with a given FDR threshold. SDA performed better than other methods in all comparisons. This file also provided simulation results with 5%, 10% or 20% differentially abundant features for the second simulation scenario described in the main text. SDA also outperformed other methods in identifying differentially abundant features, especially nonnormally distributed features. In addition, this file provided results from a singlefeature simulation study showing that SDA preserved the type I error rate around 5% for twogroup comparison in presence of detection limit.
Additional file 2
Data acquisition procedure, data normalization and lipid assignment of differentially abundant features for the lung cancer exosomal lipid dataset.
Additional file 3
A proposition to show that for the case of twogroup comparison in presence of detection limit, our test is also a valid test (in terms of preserving the type i error rate) to assess the null hypothesis that the distribution of underlying abundance level without censoring by the detection limit is the same between groups.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Li, Y., Fan, T.W., Lane, A.N. et al. SDA: a semiparametric differential abundance analysis method for metabolomics and proteomics data. BMC Bioinformatics 20, 501 (2019). https://doi.org/10.1186/s128590193067z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s128590193067z
Keywords
 Differential abundance analysis
 Metabolomics
 Proteomics
 Semiparametric loglinear model
 Kernel smoothing