For 1 D MS spectrum, we first perform spectrum preprocessing to remove the baseline, filter the noise and generate a list of peptide candidates. Then BPDA is applied based on the developed MS model to infer the best fitting peptide signals of the observed spectrum, the results being peptide abundances, existence probabilities and so on. For 2 D LC-MS spectra, we first detect peptide elution peaks along the retention time dimension, and build elution peak groups by collecting the peaks which have similar retention time together using a method similar to [24]. Each group contains a series of consecutive spectra, which are then averaged to form a mean spectrum. The rationale of using a mean spectrum to represent the group is that the noise of consecutive spectra could be canceled out to a certain degree [11]. The BPDA algorithm is then applied to each of the mean spectra, and finally an overall peptide list is generated. The details of the preprocessing step, the proposed MS model, and the BPDA algorithm are described in the following subsections.
Spectrum preprocessing and obtaining peptide candidates
A non-flat baseline is often observed in mass spectra, the presence of which can distort the true signal pattern. Thus the first preprocessing step is to detect and subtract the baseline from MS spectra. We use the minimum of a sliding window along the m/z axis as the baseline, similar to the method used in [10]. The next step is peak detection. We use the Matlab function "mspeaks" [26] to perform this task. The algorithm first identifies all local maxima in the wavelet denoised spectrum as putative peak locations. Then peaks are filtered based on their intensities and signal to noise ratios. The last step of preprocessing is to obtain a list of peptide candidates. Considering one detected peak with centroid at m/z value d, we want to find out which peptides can potentially register a peak at this position. The answer is given below in terms of the masses of such peptides:
(1)
where mass is the mass of one peptide candidate, m
pc
is the mass of one positive charge and m
nt
is the mass shift caused by addition of one neutron. Due to mass defect, the mass shift varies for different elements. We approximate m
nt
using the mass shift from 13C to 12C, which is 1.0034, since Carbon contributes most to the isotope patterns. This approximation works well if the mass calibration of the instrument is correct. The parameters cs and iso are user defined maximum numbers of considered charge states and isotopic positions, respectively. It is easy to see from the above equation that each detected peak gives rise to cs × (iso + 1) different peptide candidates (masses). These candidates exhaust all the possibilities to generate the peak with centroid d, but it does not follow that all the candidates really exist in the sample. Therefore, our primary goal in peptide detection is to find the existence probability of each peptide candidate. Also note that the total number of candidates should be less than or equal to cs × (iso + 1) × number of detected peaks, as is possible that multiple peaks yield the same candidate mass.
Modeling the mass spectrum
Suppose N peptide candidates are obtained from the observed spectrum using the method described in the previous section. Each candidate can generate a series of peaks over different charge states, and at each charge state several isotopic peaks can be registered. The signal generated by the k th peptide candidate is thus modeled by the following equation, in which i and j represent the charge state and the isotopic position of the candidate peptide, respectively:
(2)
where the peak shape function is given by . That is, the peak is modeled as Gaussian-shaped, as in [27]. It is reported that the Gaussian-shaped peak approximates the reality well enough to obtain good detection results [17]. Still, this peak shape function can be adjusted for different instruments without affecting the overall structure of the algorithm.
The observed spectrum is a mixture of the signal generated by the N peptide candidates plus Gaussian random noise, which can be modeled as:
(3)
In the above three equations, x
m
is the m th mass-to-charge ratio (m/z) in the spectrum, y
m
is the observed intensity at x
m
, M is the number of observations, and ϵ
m
is Gaussian random noise with zero mean and standard deviation σ. The value of can be approximated by the standard deviation of the background region in the spectrum. Note that we model ϵ
m
as additive Gaussian which is generally a good model for the thermal noise in electronic instruments. There are reports of non-Gaussian noise in FTMS [28] and thus it is safer to apply the proposed algorithm to TOF MS instruments [29]. The parameters of the k th candidate, namely, α
k,ij
, ρ
k,ij
, λ
k
and c
k,ij
are discussed in detail below:
where mass
k
is the mass of the k th candidate. Since the candidate's mass is already obtained, α
k
, ijcan be calculated.
-
ρ
k,ij
relates to the shape (width) of the peak centered at α
k,ij
. It can be estimated by using its relationship to the peak's Full Width at Half Maximum (FWHM): .
-
λ
k
is an indicator random variable, which is 1 if the k th peptide candidate truly exists in the sample and 0 otherwise.
-
c
k,ij
is the height (i.e. intensity) of the peak generated by peptide k, at charge state i and isotopic number j.
In summary, the model considers peaks at different isotopic positions and charge states simultaneously for each peptide candidate, incorporating candidates' existence probabilities and the spectrum thermal noise.
Bayesian peptide detection
Let
be the set of all the unknown model parameters. The goal of our algorithm is to determine the value of θ based on the observed spectrum y = [y1,..., y
M
] T . In fact, the value of λ
k
is of our prime interest for the peptide detection problem. For this purpose, we can use a Bayesian approach to first obtain the a posteriori probability (APP) of all the parameters, P (θ| y). Then the APPs P (λ
k
|y), k = 1, ..., N, can be obtained by integration of the joint posterior distribution P (θ| y) over all parameters except λk. Clearly, the calculation involves high dimension integration which is not an easy task. Besides, due to the highly nonlinear nature of the data model, none of the desired APPs can be obtained analytically. To overcome the computational obstacle, we resort to the Gibbs sampling method [30], which is a variant of the Markov Chain Monte Carlo (MCMC) approach [31], to sample the model parameters.
Gibbs sampling is an iterative scheme, which uses the popular strategy of divide-and-conquer to sample a subset of parameters at a time while fixing the rest at the sample values from the previous iteration, as if they were true. In other words, for the l th parameter group θ
l
, we sample from the conditional posterior distribution P( θ
l
|θ-l, y), where θ
-l
≜ θ\θ
l
. After this sampling process iterates among the parameter groups for a sufficient number of cycles (which is referred to as the "burn-in" period), convergence is reached. The samples collected afterwards are shown to be from the marginal posterior distribution P( θ
l
|y), which is independent of θ
-l
, and thus these samples can be used to estimate the target parameters.
The Gibbs sampling process for the k th peptide candidate and the derivations of the conditional posterior distributions of important model parameters are briefly summarized below. The detailed derivations can be found in Additional file 1.
-
Sample the peak height vector ck ≜ [c
k,ij;
i = 1,..., cs, j = 0,..., iso]Tfor the k th peptide candidate
The heights of all the possible peaks (over different charge states and isotopic positions) of the k th peptide candidate are included in the peak height vector c
k
and are sampled simultaneously from the conditional posterior distribution of c
k
, which, by the Bayesian principle, is proportional to the likelihood times the prior:
(5)
where .
The derivations of the likelihood, the prior distribution and the conditional posterior distribution of c
k
are given in Additional file 1.
• Sample the peptide existence indicator variable
λ
k
The conditional posterior distribution of λ
k
is given by
where .
The derivations of the likelihood, the prior distribution and the conditional posterior distribution of λ
k
are given in Additional file 1
The complexity of the proposed Gibbs sampling algorithm is determined by two factors: (1) the sheer number of peptide candidates, and (2) the correlation between parameters that need to be sampled. The algorithm complexity grows exponentially with the number of peptide candidates, and the correlation between parameters reduces the sampling efficiency. To address these two issues, we first partition non-overlapping peptide candidates into different groups. The proposed algorithm can be applied to each group in a parallel manner and the algorithm complexity is reduced, because within each group the number of candidates is reduced, and the corresponding signal-containing spectrum region is restricted. Peptide candidates within each group are then clustered by the k-means clustering algorithm [32], the distance measure being the correlation between peptide candidate signals. Peptide candidates within a cluster have strong correlations among each other, and their indicator variables are sampled from the joint conditional posterior distribution. These two measures improve the overall efficiency of the algorithm. The pseudocode of the entire Gibbs sampling process is given in Additional file 2: Table S1.
The samples taken after convergence can be used to estimate the target parameters. Particularly, the existence probability of peptide k is calculated as
(6)
where r0 is the first iteration after convergence is reached, R is the total number of iterations, and is the sample value of λ
k
in the r th iteration. The k th peptide candidate is said to be detected if its existence probability P(λ
k
= 1|y) is greater than a predefined threshold.