 Research article
 Open Access
A semiparametric modeling framework for potential biomarker discovery and the development of metabonomic profiles
 Samiran Ghosh^{1}Email author,
 David F Grant^{2}Email author,
 Dipak K Dey^{3} and
 Dennis W Hill^{2}
https://doi.org/10.1186/14712105938
© Ghosh et al; licensee BioMed Central Ltd. 2008
 Received: 08 August 2007
 Accepted: 23 January 2008
 Published: 23 January 2008
Abstract
Background
The discovery of biomarkers is an important step towards the development of criteria for early diagnosis of disease status. Recently electrospray ionization (ESI) and matrix assisted laser desorption (MALDI) timeofflight (TOF) mass spectrometry have been used to identify biomarkers both in proteomics and metabonomics studies. Data sets generated from such studies are generally very large in size and thus require the use of sophisticated statistical techniques to glean useful information. Most recent attempts to process these types of data model each compound's intensity either discretely by positional (mass to charge ratio) clustering or through each compounds' own intensity distribution. Traditionally data processing steps such as noise removal, background elimination and m/z alignment, are generally carried out separately resulting in unsatisfactory propagation of signals in the final model.
Results
In the present study a novel semiparametric approach has been developed to distinguish urinary metabolic profiles in a group of traumatic patients from those of a control group consisting of normal individuals. Data sets obtained from the replicates of a single subject were used to develop a functional profile through Dirichlet mixture of beta distribution. This functional profile is flexible enough to accommodate variability of the instrument and the inherent variability of each individual, thus simultaneously addressing different sources of systematic error. To address instrument variability, all data sets were analyzed in replicate, an important issue ignored by most studies in the past. Different model comparisons were performed to select the best model for each subject. The m/z values in the window of the irregular pattern are then further recommended for possible biomarker discovery.
Conclusion
To the best of our knowledge this is the very first attempt to model the physical process behind the timeof flight mass spectrometry. Most of the state of the art techniques does not take these physical principles in consideration while modeling such data. The proposed modeling process will apply as long as the basic physical principle presented in this paper is valid. Notably we have confined our present work mostly within the modeling aspect. Nevertheless clinical validation of our recommended list of potential biomarkers will be required. Hence, we have termed our modeling approach as a "framework" for further work.
Keywords
 Posterior Distribution
 Markov Chain Monte Carlo
 Credible Interval
 Markov Chain Monte Carlo Algorithm
 High Resolution Mass Spectrometry
Background
The enormous amount of data generated by mass spectrometric analysis requires sophisticated statistical techniques to differentiate between the urinary metabolic profile of traumatized individuals and healthy individuals. A typical mass spectrum contains thousands of points (m/z values) while subject numbers are quite low. We propose a semiparametric framework to model individual mass spectra obtained from each subject over all replicates. The model or functional form thus developed is termed as subject profile. We have explored different model validation criteria in this regard. Mass spectral profiles of the combined control group and individual patients are compared through a survival relative intensity function (SRIF). A deviation or irregularity in the pattern of the patients from that of the control group indicates possible effects due to trauma. Aided with this knowledge, a mechanism based on predictive relative abundance is recommended to identify potential biomarkers associated with trauma. Though we have shown efficacy of our method for metabonomics this same approach could be easily extended for other domains, as long as the physical mechanism that generates data remains the same.
The rest of the paper is organized as follows. We first develop the semiparametric framework in the Methods section. Choice of prior and development of posterior sampling schemes are discussed next. Following that we describe predictive relative abundance criteria and construction of SRIF for biomarker discovery framework. Then we illustrates different model validation and comparison criteria. We tested our proposed methodology with actual data in the Results and Discussions section. We then provide a brief discussion and direction for future work.
Methods
0.1 Semiparametric Framework
Note that H* : ℝ^{+} → [0, 1]. Having transformed the mean intensity into the interval [0, 1] we would like to approximate ${H}^{\ast}(t)={\displaystyle {\sum}_{l=1}^{m}{\eta}_{l}IB(\tilde{H}(t);{r}_{l},{s}_{l})}$, where η_{ l }are weights satisfying ${\sum}_{l=1}^{m}{\eta}_{l}=1$ for all replicates and I B(.; r_{ l }, s_{ l }) denotes the incomplete beta function. Collecting the weights into η= (η_{1}, η_{2}, ... η_{ m }) a Dirichlet prior can be assigned. The (r_{ l }, s_{ l }) are chosen to ensure that Beta c.d.f's have equally spaced means and are centered around $\tilde{H}$(t). $\tilde{H}$(t) is a suitable function which maps the time scale into [0,1]. The related issue of choosing $\tilde{H}$(t) need to be resolved by selecting a plausible central function around which the H* function is distributed. If we consider cumulative intensity function, parallel to the cumulative hazard function of the survival analysis literature [10] then we can take $\tilde{H}(t)=\frac{{\tilde{H}}_{0}(t)}{1+{\tilde{H}}_{0}(t)}$. We would like to use Extreme value, Double exponential and Normal distribution for $\tilde{H}$_{0}(t) and perform model comparison. We consider different choices of $\tilde{H}$_{0}(t) as

Extreme value : ${f}_{0}(t)=\frac{1}{\beta}{e}^{\frac{t\alpha}{\beta}}\mathrm{exp}({e}^{\frac{t\alpha}{\beta}})$, ${\tilde{H}}_{0}(t)={e}^{\frac{t\alpha}{\beta}}$; γ = (α, β) and β > 0,

Double Exponential : ${f}_{0}(t)=\frac{1}{2\beta}{e}^{\left\frac{t\alpha}{\beta}\right}$, ${\tilde{H}}_{0}(t)=\{\begin{array}{ll}\mathrm{log}\left[1\frac{1}{2}\mathrm{exp}\left(\frac{t\alpha}{\beta}\right)\right],\hfill & t\le \alpha \hfill \\ \frac{t\alpha}{\beta}+\mathrm{log}2,\hfill & t\ge \alpha \hfill \end{array};\gamma =(\alpha ,\beta )$ and β > 0,

Normal : ${f}_{0}(t)=\frac{1}{\sigma \sqrt{2\pi}}\mathrm{exp}\left\{\frac{1}{2}{\left(\frac{t\mu}{\sigma}\right)}^{2}\right\}$, ${\tilde{H}}_{0}(t)=\mathrm{log}\left\{1\Phi \left(\frac{t\mu}{\sigma}\right)\right\}$; γ = (μ σ), σ > 0.
0.2 Likelihood Construction
An interesting departure from the regular hierarchical model setup is that the above likelihood does not integrate all the subjects at once. The same model is fitted for different subjects by integrating over all subject specific replicates to capture subject specific variations. Note that γ, the vector of parameters associated with $\tilde{H}$(t) and f (γ) is a corresponding suitably chosen prior. Clearly, likelihood is a function of mixing weights η= (η_{1}, η_{2}, ... η_{ m }). With the above data likelihood specific choice of $\tilde{H}$(t) and adjoining suitable priors to the η and γ, a complete Bayesian hierarchical model setup is completed. We further elucidate the choice of different priors and explicit steps for numerical calculation in next section.
Prior Choices and Numerical Implementation
The flexibility of our approach comes from modeling the η, which gives subjectspecific information. It is natural to model η as f (ηφ_{ η }), is the Dirichlet prior drawn from Dirichlet (φ_{ η }1), where we take a fixed scalar hyperparameter ${\varphi}_{\eta}=\frac{1}{m}$, as in our case m = 5. As pointed out by [10] this choice combined with evenly spaced beta distribution parameter, namely r_{ l }and s_{ l }will give rise to a model centered around $\tilde{H}$(t).
We fix m = 5, {r_{ l }} = (1, 2, 3, 4, 5) and {s_{ l }} = (5, 4, 3, 2, 1) which will produce five evenly spaced beta distributions. Five beta mixtures should not be mixed with the coarseness of the MS data. It should be noted that the present modeling effort is concentrated on the hazard space where five beta mixtures are used as an artifact with different choice of centering functions to capture wide range of intensity functions. A flexibility analysis of the presented model is also included in the supplementary material (see Additional file 1). Parameters associated with γ are either location or scale parameters. For the location parameter (α or μ) we have chosen a Gaussian prior with sufficiently large variance. Since the observed relative intensity even in the log scale does not go below 2 (minimum resolution is 100 daltons in the mass spectrometer), some truncation is necessary to achieve this goal. We have used truncated normal distribution (truncated any value that goes below 0). From a practical point of view this decision makes MCMC sampling scheme to converge much faster while used in conjunction with normal proposal. For scale parameter (β or σ^{2}) initially an inverse gamma prior is assigned. Though this works well for Extreme value distribution, for other two cases we found that MCMC chain is very slow from jumping one state to another. Thus for those cases, a log transformation is used for the scale parameter which also helps to symmetrize the posterior distribution. A normal proposal is used for this transformed case. Nevertheless the posterior distribution is always complex and implementation of this Bayesian procedure requires MCMC sampling scheme. In general it is difficult to show logconcavity of the conditional posterior distribution of η in its each component. Keeping this in mind and for the computational flexibility we would make a transformation for η. Since the support of each η_{ l }is [0, 1] we make change of variable for (η_{1}, η_{2}, η_{3}, η_{4}) to (θ_{1}, θ_{2}, θ_{3}, θ_{4}), where ${\theta}_{l}=\mathrm{log}\left(\frac{{\eta}_{l}}{1{\eta}_{l}}\right)$ for l = 1, ..., 4. Note that we need to specify only four of the η_{ l }'s as ${\eta}_{5}=1{\displaystyle {\sum}_{l=1}^{4}{\eta}_{l}}$. The inverse transformation is ${\eta}_{l}=\frac{\mathrm{exp}({\theta}_{l})}{1+\mathrm{exp}({\theta}_{l})}$, with Jacobian $\frac{\mathrm{exp}({\theta}_{l})}{{[1+\mathrm{exp}({\theta}_{l})]}^{2}}$. The above transformation makes it easier to specify the proposal density, although one can construct Metropolis chain for the untransformed parameter. The proposal distribution is normal ${N}_{4}({\theta}_{l}^{(t1)},\widehat{\Sigma})$, where initial estimates are obtained through NelderMead downhill simplex algorithm. To accelerate convergence from the output of the first algorithm we got a new estimate of $\widehat{\Sigma}$ as $\frac{1}{G}{\displaystyle {\sum}_{g=1}^{G}({\theta}_{j}\overline{\theta})({\theta}_{j}\overline{\theta})}$, where g indexes Monte Carlo samples. The sampling scheme implemented is a variation of the grouped Gibbs sampler [11, 12], which requires drawing:
 1.
θ~ π (θγ, $D={\displaystyle {\cup}_{i=1}^{r}{D}^{i}}$), with acceptance probability $\alpha =\mathrm{min}\left\{1,\frac{\pi (\theta \gamma ,D){N}_{4}({\theta}_{0},\widehat{\Sigma})}{\pi ({\theta}_{0}\gamma ,D){N}_{4}(\theta ,\widehat{\Sigma})}\right\}$ and then transform back θ↦ η,
 2.
μ ~ π (μη, σ p, D),
 3.
σ ~ π (μη, μ, p, D).
Since full conditionals are not in conventional form we approximate them by sampling with a Metropolis step within the Gibbs sampler. Under this setup η and γ are updated sequentially and monitored for convergence following post burnin period. We would like to mention at this point that logtransformation of the observed m/z values are done for greater numerical stability of the MCMC algorithm only.
Predictive Calculation
We have a two folded objective namely subject diagnostics and potential biomarker identification. Notably subject diagnostic is a global summarization while biomarker (m/z values) identifications are more localized in nature. For this we proposed two criteria namely Survival Relative Intensity Function (SRIF) and Predictive Relative Abundance (PRA). SRIF is calculated at low resolution, the very essence being global estimation of subject profile (a representative curve describing overall rough shape) for general subject diagnostic. PRA is directed to identify potential biomarkers associated with anomalous m/z values at much finer resolution. This two step breakdown of estimation at different resolution is essential for successful information retrieval from highly irregular MS data.
0.3 Predictive Relative Abundance
To get an estimated value of h(t) through equation (1) the only requirement is the differentiability of $\tilde{H}$(t), which in turn depends upon the differentiability of $\tilde{H}$_{0}(t). Hence once the parameter in Θ is estimated through MCMC, we can easily calculate $\widehat{h(t)}$. Under the assumption that control group is homogeneous, posterior distribution for each model parameter should have similar distribution across all controls. To illustrate more, consider a single parameter only (say α). We obtain posterior samples from all control subjects separately and then mix them together under the hypothesis that posterior distribution of α is same for all controls. This gives us resultant posterior distribution combining all controls for α. This pooled sample is then used to construct ${\widehat{\delta}}_{t}^{i}$, and its 95% credible interval. This will give us a upper and lower bound within which an observed relative abundance (d_{ t }) at any m/z value should be lying for any normal subject. We hypothesize that deviation from the above will indicate possible effects of trauma.
0.4 ProductLimit type Estimate of SRIF
Since ${\widehat{\delta}}_{t}^{i}$ is a function of Θ, so does S(t). Once the overall pulled estimates of the parameters of Θ based on the control group is obtained, we would like to treat S(t) thus obtained from the control group as the representative surviving relative intensity function (SRIF) of a normal individual being. Similar to the relative abundance case we would like to draw 95% credible interval for the ideal SRIF obtained from the control group. For a patient, we would like to compare this ideal SRIF band with that of the individually observed SRIF obtained through equation (6).
As mentioned earlier, SRIF will give us an overall visual idea about an individual's wellbeing. Depending upon this initial health identification status, we will recommend further investigation through predictive relative abundance bound described earlier (subsection 0.3), for identifying abnormal relative abundance associated with specific individual m/z value. Thus identified m/z values are further recommended for compound level identification as potential biomarker associated with traumatic disorder. Some additional discussion regrading the nature and computation of SRIF and PRA is also provided in the supplementary material (see Additional file 1).
Different Model Comparison Criteria
In the present context we have entertained three different model choices, which necessitates model comparison for getting the best fitted model. We have explored both Bayesian as well as some frequentist model selection criteria. While some of them are readily applicable, some requires modification to be meaningful for mass spectrometry data set. We have used Conditional Predictive Ordinate (CPO), Bayes Factor, Bayesian Information Criterion and a variant of Mean Square Based Criterion for model assessment. To check predictive accuracy, a new measure namely Conditional Predictive Intensity/Hazard Function (CPIF) has been developed. We have also discussed in details different estimation strategy involving Monte Carlo approximation of the above measures. For the sake of space statistical justification and calculation behind all of these criteria are placed in the supplementary material (see Additional file 1). We would like to note that DIC is not being explored in this paper as it is under severe scrutiny for its lack of interpretation in different scenario. However its implementation is not difficult and can be easily done if required.
Results and Discussion
0.5 Data Description: Trauma Data Set
Acute trauma with associated hemorrhage, shock and sepsis, often produces a generalized inflammatory response resulting in organ dysfunction and organ failure. Acute trauma is associated with high mortality rates and is among the leading causes of death in Americans between the ages of 1 and 44 [13]. In this paper we explore the feasibility of using a high resolution mass spectrometry to identify and quantify metabolites in the urine of acute trauma patients and compare these with control urine samples. The rationale behind this approach is that such data would provide a "metabonomic profile" that could be used for realtime analysis of acute trauma patient status. It should be noted that trauma is not a disease and is very much heterogenous in nature. Each trauma patient likely to be unique. This points out that though normal subjects can be combined to develop a standard "metabonomic profile", patients may not necessarily be combined unless we can create some homogenous traumatic disorder in laboratory controlled environment. Our goal is to determine whether a patient profile is anomalous in comparison to standard metabonomic profile and then detect patient specific signals for individual trauma patients. With these goals in mind we have retrospectively analyzed urine samples collected from 6 normal healthy individuals and 6 patients with different degree of acute trauma. Each urine sample was analyzed in triplicate (r = 3). Every patient is coded as "PDG", while for every control it is "CDG". Creatinine concentrations were measured [14] in each urine sample and each sample was analyzed by electrospray ionization on a Micromass QTOF2 mass spectrometer using lisinopril as a calibration and quantitative internal standard. Mass spectra were collected in positive ionization mode from 100 to 1600 daltons. Mass spectral data was collected in the continuum mode and processed by MassLynx (Micromass) software to generate centroid spectral data consisting of accurate m/z values and ion intensities. The data used in the present study were generated by eliminating all ions that were less than or equal to three times the instrument noise and dividing the intensity of each ion by the concentration of creatinine in the respective urine sample. The normalization of the ion intensities to the creatinine concentration was necessary to compensate for urine dilution in different individuals. Ion resolution in this system was between 9200 and 9800 (Δm_{1/2}/m) and m/z reproducibility was less than +/ 20 ppm. The m/z derived from a specific compound may therefore vary by the reproducibility of the method if the ion is resolved from the ions produced by other compounds. As described in section 0.1, we have entertained three different model choices; Model 1: $\tilde{H}$_{0}(t) corresponds to Extreme value, Model 2: $\tilde{H}$_{0}(t) corresponds to Double Exponential and Model 3: $\tilde{H}$_{0}(t) corresponds to Normal distribution. All of these models have seven parameters, which need to be estimated through MCMC. The first 5,000 iterations were thrown out as burnin period, then an additional 50,000 iterations were obtained out of which we accepted only every 50th iteration, creating 1000 samples from the posterior distribution. This was used to compute all necessary statistics namely posterior mean, 95% credible interval and other measure related to model comparison. Every MCMC chains were tested for convergence by Geweke's [15] convergence diagnostic and autocorrelation plot through R software. Several combinations of hyperparameter values were tried. We report the results for prior distribution of the location parameter as N (2, 10) and for scale parameter as IG (2, 10), so that priors do not drive the inference.
0.6 Model Comparison Results
Different model parameter estimates, standard deviation and their 95% highest posterior density (HPD) credible intervals are placed in the supplementary material (see Additional file 1) and can easily accessed by interested reader. We performed rigorous model selection on the basis of different criteria described earlier. On the basis of log of Pseudo Marginal Likelihood, Bayes factor and BIC, clearly Model 2, based upon Double Exponential distribution is outperforming other models. For some situations Model 1 was doing better. Notably for MSE and cross validation based approach Model 1 was doing better and Model 2 was close second. Model 3 based on normal distribution is not performing well. To save space all model selection results are placed in the supplementary material (see Additional file 1). Though the results are not reported here, we have also tried a mixture of two normal distribution, but this choice produces no improvement. We acknowledge that the decision of choosing the best model is subjective, especially considering so many different model selection criteria all together and in fact there may not be any unique and unanimous model choice in the real world. However, combining all the criterion we suggest the Double Exponential based model as being best model choice for subject diagnostic and biomarker detection purpose. Notably Model 1 is also a strong contender and other mixture model based approach may be taken for modeling $\tilde{H}$_{0}(t). However we would like to investigate them else where in future.
0.7 Subject Diagnostic and Mechanism for Biomarker Identification
Conclusion
In this paper we have demonstrated a novel semiparametric modeling technique for analyzing metabonomic data of acute trauma subjects using mass spectrometry. We have proposed modeling the intensity function, using a mixture of incomplete Beta functions. Also rather than traditional point based modeling we have modeled the whole curve or subject profile through this approach. Different model comparison criterion were used for selecting the best model, both from a classical and Bayesian point of view. We have used a mixture of five Betas for our modeling purpose. A more flexible approach will be to use variable number of mixture components, which will necessitate use of the reversible jump MCMC algorithm.
We would also like to emphasize the advantages of modeling a profile. First, spatial correlations that may exist among chemically related m/z values will be automatically incorporated into the model. Secondly, due to inherent instrument variability, specific or static m/z values may loose their true meaning when compared across different subjects. This again questions the traditional point by point comparison of m/z values across different samples. In fact researchers already perceived this problem and for that reason most [1, 7, 16] mass spectrometry based modeling begins with an alignment or clustering algorithm. Our method is general enough to be applicable with or without alignment. In the absence of any alignment, credible intervals constructed will be wider. This is reasonable considering principles of signal to noise strength. However utmost care should be taken before making any such alignment as wrongly aligned m/z values will result in large false positives, irrespective of the efficacy of latter used methods.
Declarations
Acknowledgements
We would like to thank Grant V. Bochicchio and R. Adams Cowley of University of Maryland for their help in data collection. Funding for this research was partially provided by the National Institute of Health (R01 ES011630), a UCONN Faculty Large Grant and the Office of Naval Research (N000140010792, N000149910905 and N000149910606).
Authors’ Affiliations
References
 Tibshirani R, Hastie T, Balasubramanian N, Scott S, Gongyi S, Albert K, QuynhThu L: Sample classification from protein mass spectrometry, by peak probability contrasts. Bioinformatics 2004, 20: 3034–3044. 10.1093/bioinformatics/bth357View ArticlePubMedGoogle Scholar
 Timothy WR, Yutaka Y: Multiscale Processing of Mass Spectrometry Data. Tech. Rep. Working Paper 230, UW Biostatistics Working Paper Series, Fred Hutchinson Cancer Research Center 2004.Google Scholar
 Kwon DW, Tadesse MG, Sha N, Pfeiffer RM, Vannucci M: Identifying Biomarkers from Mass Spectrometry Data with Ordinal Outcome. Cancer Informatics 2006, 3: 19–28.Google Scholar
 Randolph TW, Tasui Y: Multiscale processing of mass spectrometry data. Biometrics 2006, 62: 589–597. 10.1111/j.15410420.2005.00504.xView ArticlePubMedGoogle Scholar
 J M, Combes RK, Baggerly K, Kobayasi R: Feature extraction and quantification for mass spectrometry data in biomedical application using the mean spectrum. Bioinformatics 2005, 21: 1764–1775. 10.1093/bioinformatics/bti254View ArticleGoogle Scholar
 Coombes RK, Morris SM, Hu J, Edmonson RS, Baggerly AK: Serum proteomics profiling – a young technology begins to mature. Nature Biotechnology 2005, 23: 291–292. 10.1038/nbt0305291View ArticlePubMedGoogle Scholar
 Yutaka Y, McLerran D, BaoLing A, Marcy W, Thornquist M, Ziding F: An Automated Peak Identification/Calibration Procedure for HighDimensional Protein Measures From Mass Spec trometers. Journal of Biomedicine and Biotechnology 2003, 4: 242–248.Google Scholar
 Kazmi AS, Ghosh S, Shin DG, Hill DW, Grant FD: Alignment of high resolution mass spectra: Development of a heuristic approach for metabolomics. Metabolomics 2006, 2: 75–83. 10.1007/s1130600600217View ArticleGoogle Scholar
 Diaconis P, Ylvisaker D: Quantifying prior opinions. In Bayesian Statistics 2. Edited by: Bernardo JM, Berger JO, Smith AFM. Amsterdam: NorthHolland; 1985:133–156.Google Scholar
 Gelfand A, Mallick BK: Bayesian analysis of proportional hazards model built from monotone functions. Biometrics 1995, 51: 843–852. 10.2307/2532986View ArticlePubMedGoogle Scholar
 Geman S, Geman D: Stochastic relaxation, Gibbs distribution and the Bayesian restoration of images. IEEE Trans. on Pattern Anal. and Mach. Intel 1987, 6: 721–741.Google Scholar
 Gelfand AE, Smith AFM: Samplingbased approaches to calculating marginal densities. Journal of the American Statistical Association 1990, 85: 398–409. 10.2307/2289776View ArticleGoogle Scholar
 Anderson RN, Smith BL: Deaths: leading causes for 2001. Natl Vital Stat Rep 2003, 52(9):1–85.Google Scholar
 Greenblatt DJ, Ransil BJ, Harmatz JS, Smith TW, Duhme DW, KochWeser J: Variability of 24hour urinary creatinine excretion by normal subjects. Journal of Clin Pharmacol 1976, 16(7):321–328.View ArticleGoogle Scholar
 Geweke J: Evaluating the accuracy of samplingbased approaches to calculating posterior moments. In Bayesian Statistics 4. Edited by: Bernardo JM, Berger JO, Dawid AP, Smith AFM. Oxford University Press; 1992.Google Scholar
 Ghosh S, Dey D, Hill D, Grant FD: Statistical Approach to Metabonomic Analysis of Rat Urine Following Surgical Trauma. Journal of Chemometrics 2006, 20: 87–98. 10.1002/cem.972View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.