A semiparametric modeling framework for potential biomarker discovery and the development of metabonomic profiles

Background The discovery of biomarkers is an important step towards the development of criteria for early diagnosis of disease status. Recently electrospray ionization (ESI) and matrix assisted laser desorption (MALDI) time-of-flight (TOF) mass spectrometry have been used to identify biomarkers both in proteomics and metabonomics studies. Data sets generated from such studies are generally very large in size and thus require the use of sophisticated statistical techniques to glean useful information. Most recent attempts to process these types of data model each compound's intensity either discretely by positional (mass to charge ratio) clustering or through each compounds' own intensity distribution. Traditionally data processing steps such as noise removal, background elimination and m/z alignment, are generally carried out separately resulting in unsatisfactory propagation of signals in the final model. Results In the present study a novel semi-parametric approach has been developed to distinguish urinary metabolic profiles in a group of traumatic patients from those of a control group consisting of normal individuals. Data sets obtained from the replicates of a single subject were used to develop a functional profile through Dirichlet mixture of beta distribution. This functional profile is flexible enough to accommodate variability of the instrument and the inherent variability of each individual, thus simultaneously addressing different sources of systematic error. To address instrument variability, all data sets were analyzed in replicate, an important issue ignored by most studies in the past. Different model comparisons were performed to select the best model for each subject. The m/z values in the window of the irregular pattern are then further recommended for possible biomarker discovery. Conclusion To the best of our knowledge this is the very first attempt to model the physical process behind the time-of flight mass spectrometry. Most of the state of the art techniques does not take these physical principles in consideration while modeling such data. The proposed modeling process will apply as long as the basic physical principle presented in this paper is valid. Notably we have confined our present work mostly within the modeling aspect. Nevertheless clinical validation of our recommended list of potential biomarkers will be required. Hence, we have termed our modeling approach as a "framework" for further work.


CPO Based Criterion
The CPO statistics is a very useful tool in statistical literature, first defined by Geisser (Geisser, 1980) as a diagnostic measure to detect observations discrepant from a given model. For detailed discussion on this please see (Gelfand, Dey and Chang, 1992) and (Sinha and Dey, 1997). In our situation where we have r replicates from a single subject, the whole data set for a subject is  (1) is not possible to obtain; however a Monte Carlo approximation given by (Gelfand, Dey and Chang, 1992) for the i k CPO is, where ) g ( Θ denotes the parameter samples at the g-th, g = 1, 2,…,G, iteration of the Gibbs sampler. A large CPO is indicative of the better fit at that specific point. Pseudo marginal is a useful summary statistic based on the sum of the logarithm of the CPO of the individual observation is defined as Models with greater * B values will indicate better for the specific subject profile.

Bayes Factor Based Criterion
The MCMC based approach for calculating Bayes factors was developed by (Chib, 1995), and discussed earlier by (Kass and Raftery, 1995). Given the k-th model, (Chib, 1995), has defined the log of Basic Marginal Likelihood Identity (BMI) as and the estimated Bayes factor for two rival models indexed by k and k ′ as, (2) We would like to take importance weighted marginal density estimation (IWMDE) based approach suggested by (Chen, 1994) to estimate ) (t m k directly. Let ) g ( Θ denote the parameter samples at the g-th, g = 1, 2,…,G, iteration of the Gibbs sampler from posterior distribution. Then IWMDE yields a consistent estimator for the ) ( t m k as: . A good w should be chosen such that it should have similar shape to the conditional marginal density, which is often unknown. However as pointed out in (Chen, 1994)  importance sampling density for w . We have chosen posterior mean and negative Hessian matrix to evaluate μ and Σ . Once the k k B ′ has been estimated, if it turns out to be greater than one it suggests that data supports k-th model over the k ′ -th model and vice versa.

Bayesian Information Criterion
Bayes information criterion is a model comparison criterion based on marginal likelihoods. There are various approximation of this criterion exists in the literature. However we would like to use the approximation achieved by (Schwartz, 1978), which for a model with p parameters and n sample observation is given by The first term of BIC is just involving the likelihood and the second term is a penalty for model complexity. Model with lower BIC score is preferable. Once calculated we can take the difference between BIC's of the two models to get approximate Bayes factor. We would like to refer (Dudely and Haughton, 1997) for further discussion on BIC.

Mean Square Error Based Approach
Unlike survival analysis domain, in metabonomics domain, primary interest is in i t δ , which in turn depends on the intensity function h(t). This is an estimable quantity for all subjects. The whole profile analysis in the current approach works through modeling h(t). For a specific subject the mean square error (MSE) for m-th model is given by A model with minimum MSE is chosen to be the best model in mean square sense. (4) A common cross validation criterion is to leave-one-out estimator obtained by evaluating the model parameters with k-th observation removed. Let the k-th m/z value be i k t and its corresponding relative intensity i k d is being removed from the i-th replicate of a subject to obtain i k D − as in the subsection 1.1. The available information is denoted as

A Cross Validation Measure
as the conditional predictive intensity function (CPIF). However estimation of this quantity poses additional challenge. It is computationally infeasible even for a single subject, for thousands of mass spectrometry data point. However through Monte Carlo sample obtained earlier, combined with numerical integration it is possible to estimate It is easy to note that for any m/z value or t following holds is straight forward as discussed in equation (2). To

Additional SRIF and PRA plots
We provided SRIF plot for patient P-DG51 in the main paper. Figure 1-3 represents SRIF for plots of three additional patients. The main point to note that for patients SRIF curve is considerably outside the HPD credible interval obtained from the control subjects. Interestingly for patient P-DG 31, replicate 1, (Figure 2) the SRIF plot is almost within the credible interval.
However for other replicates a large deviation is observed so that we can comfortably assign patient status. This actually strengthens the necessity of replication in MS experiment.      As discussed in the main paper SRIF and PRA plot has two distinct purposes namely, 1> overall subject diagnostic (more global in m/z scale) 2> potential biomarker identification (more localized). Hence in SRIF a broad picture of subjects well being is captured through the plot at much higher resolution. For PRA, we concentrate more on the finer scale to determine the list of m/z values falling outside HPD credible interval obtained from the controls. We hypothesized that for subjects with some disorder (trauma here) we will see much outlying m/z value, which is presented at PRA plot for three patients in figure 4-6. This will not be the case for a normal individual (please see figure 5 of the original paper). It is important to note that prior to SRIF and PRA based analysis, all the subject specific medications (for both control and patient) should be eliminated otherwise this may lead us to false discovery of medications as biomarker.

Additional Model Comparison Results
Tables 1-3 show different model parameter estimates, standard deviations and their HPD credible intervals. Notably we are not plugging the posterior estimates of different parameters directly during estimation stage. Details of pooling control subjects are being explained latter on. Tables 4-8 represents model comparison results for three models under different model selection criteria. Within the domain of these model choices we exhaustively searched for the best model. An additional flexibility analysis is also reported latter to show that by increasing the number of mixture parameters, we will not gain much in terms capturing local features. Table 1  Table 2  Table 3  Table 4   Table 5  Table 6   Table 7  Table 8 4

. Pooling of Control Subjects to Construct HPD
We are estimating seven parameters namely five mixture parameters ( ) η , location ( ) α and scale ( ) β parameter in ) ( t H for each subject. These help us estimating relative intensity (hazard) function given by Likelihood for a subject is given in equation (3) of the main paper. Note that for each of the control subjects we are obtaining samples from the posterior distributions of each parameter. Under the hypothesis that control subjects are homogeneous, the posterior parameter distribution should be similar. To illustrate more consider location parameter α only. We obtain posterior samples from all control subjects separately and then mix them together under the hypothesis that posterior distribution of α is same for all controls. Notably this will not be true if underlying hypothesis of homogeneity among controls is violated. Since different trauma types are highly heterogeneous, it is a prime reason that we cannot combine patient samples. Then we repeat the same procedure for other parameters too. Figure 7 represents SRIF plots for same subject (patient) for two different replicates. Evidently credible intervals are different and apparently may not be clear to the reader why it should be so even for the same patient across replicates. Notably credible intervals are constructed exclusively by combining controls, and of course once posterior samples are obtained they never change. However note that ) (t h is also a function of t (proxy of the m/z value) with seven other parameters and the location of the observed intensities (t) are different even between replicate to replicate. This is depicted in figure 8.

Figure 7
We calculate credible interval for intensity which corresponds to a patient based on a replicate at a position where we observe the intensity value. Since 1 t for replicate 1 is different than 1 t in replicate 2 this implies associated ) (t h value and its credible interval will be also different. In fact 1 t denotes the first m/z value (location) of a replicate and 1 t across different replicates has no straight forward relationship except for a place holder. Notably we have not used any alignment algorithm in the current study. However they (t values) will be exactly same if the replicates are in perfect alignment. The PRA estimate is given by As t δ is very much depends upon ) (t h , which in turns depends on observed t of a specific replicate (say i -th), it is necessary to introduce a superscript-i. in t δ . Since we have many  It is easy to construct HPD credible interval from G-many MC estimates of i t δ following Chen-Shao's (JCGS, 1998) algorithm. Notably if there is no variation in the x-axis (observed m/z value) we do not need i-th subscript in t δ consequently credible intervals will not vary across replicates.
Similar conclusion can be drawn from SRIF which is expressed as Again SRIF depends upon estimation of ) (t h and all the above reasoning will also be valid for difference in credible interval across replicates.

Some Comments About Flexibility with Five beta Mixtures
A pertinent question is whether the considered model can capture complexity in the data with only five beta mixtures. We performed several flexibility analyses by varying number of mixands at different levels. In our experiment we have tried increasing the number of components from five to ten. While more mixture components will somewhat improve the result however it comes at the expense of increasing the computational cost. We would like to report that for the subject specific global profile estimation (SRIF) five mixands is as good as ten for all three models considered here. However this is highly study specific and we recommended trying several of them in consultation with domain experts. In principle parsimonious model selection should be encouraged. Another way out is to treat the number of mixing components as an unknown quantity which needs to be estimated from the data. However this will come at the increasing cost of computational demand.