ProbPS: A new model for peak selection based on quantifying the dependence of the existence of derivative peaks on primary ion intensity

Background The analysis of mass spectra suggests that the existence of derivative peaks is strongly dependent on the intensity of the primary peaks. Peak selection from tandem mass spectrum is used to filter out noise and contaminant peaks. It is widely accepted that a valid primary peak tends to have high intensity and is accompanied by derivative peaks, including isotopic peaks, neutral loss peaks, and complementary peaks. Existing models for peak selection ignore the dependence between the existence of the derivative peaks and the intensity of the primary peaks. Simple models for peak selection assume that these two attributes are independent; however, this assumption is contrary to real data and prone to error. Results In this paper, we present a statistical model to quantitatively measure the dependence of the derivative peak's existence on the primary peak's intensity. Here, we propose a statistical model, named ProbPS, to capture the dependence in a quantitative manner and describe a statistical model for peak selection. Our results show that the quantitative understanding can successfully guide the peak selection process. By comparing ProbPS with AuDeNS we demonstrate the advantages of our method in both filtering out noise peaks and in improving de novo identification. In addition, we present a tag identification approach based on our peak selection method. Our results, using a test data set, suggest that our tag identification method (876 correct tags in 1000 spectra) outperforms PepNovoTag (790 correct tags in 1000 spectra). Conclusions We have shown that ProbPS improves the accuracy of peak selection which further enhances the performance of de novo sequencing and tag identification. Thus, our model saves valuable computation time and improving the accuracy of the results.

Typical database searches first identify a set of candidate peptides from a protein sequence database, and then construct a theoretical spectrum for each peptide. Finally, the similarity between the theoretical spectrum and the MS/MS experimental spectrum is calculated and the most similar peptides are reported as predictions. There are several popular tandem mass spectrometry data analysis programs of this type: SEQUEST [3], Mascot [4], X!Tandem [5], SCOPE [6], and ProbID [7], are some examples of these. Before comparing a theoretical spectrum against an experimental spectrum, noise peaks in the experimental spectrum should be filtered out. Noise peaks in the spectrum can cause significant differences between the experimental and theoretical spectra and, as a result, correct solutions may be missed.
De novo sequencing, on the other hand, is databaseindependent because it exclusively uses the information contained in the MS/MS spectrum. Thus, the de novo technique has the potential to identify peptides that are not included in protein sequence databases. Widelyused de novo packages include PEAKS [9,10], PepNovo [11,12], et al. [13][14][15][16][17][18][19][20][21][22][23][24][25][26][27][28][29][30][31] Recently, variants of de novo sequencing, the tag-based methods [32][33][34][35][36][37][38], have been developed to identify a segment of a peptide rather than a full-length peptide. After inferring the tags from a MS/MS spectrum, the candidate peptides that do not match any of the tags are filtered out. Therefore, an effective tag identification method can improve identification accuracy and reduce the running time for database searches by reducing the number of candidate peptides. Both de novo methods and tag-based methods usually require high-quality spectra, and do not perform well on spectra with noise peaks. Thus, peak selection is important for the effective use of de novo methods.
Generally speaking, there are three types of peaks in a tandem mass spectrum: i) the primary peak that is highly likely to be accompanied by a set of derivative peaks caused by the loss of ammonia, the loss of water, or isotopic shift; ii) noise peaks from signals from mass spectrometry and other unknown reasons; and iii) peaks generated from contaminants. Although isotopic shifts and neutral losses are often observed for peaks generated from contaminants, complementary peaks are seldom observed. This provides a way to distinguish valid peaks from noise and contaminant peaks. In this study, the latter two peaks are called noise peaks.
Before attempting to identify a peptide from a MS/MS spectrum, it is useful to perform a pre-processing step (called peak selection) to filter out noise and contaminant peaks. A widely accepted peak selection rule utilizes two peak attributes, peak intensity and the existence of derivative peaks. Briefly, a peak accompanied by derivative peaks and an associated complementary peak is likely to be valid; peaks without these features are likely to be noise. Our observations suggest that the existence of derivative peaks and complementary peaks is strongly depending on the primary peak intensity. Existing methods for peak selection adopt simple models that assume that these two attributes are independent. This assumption contradicts to real data and is error prone. In this study we proposed a statistical model, named ProbPS, to capture the interdependence of peak intensity and the existence of derivative peaks in a quantitative manner. Our experimental results demonstrate that our model can improve both peak selection and tag identification.

Notation
For a peak p in a tandem mass spectrum, • V = 1 if the peak is a valid primary peak; otherwise V = 0.
• I is the peak intensity; • ISO indicates the existence of isotopic shift; • NH 3 indicates the existence of a peak that corresponds to the neutral loss of an ammonia; • H 2 O indicates the existence of a peak that corresponds to the neutral loss of a water; • COMP indicates the existence of a peak that corresponds to a complementary ion; 2.2 The model for peak selection 2.2.1 Quantifying the dependency of derivative ions on primary peak intensity To investigate the dependency of derivative ions on primary peak intensity we used spectra from the Swed-CAD database [39], a collection of high quality MS/MS spectra of tryptic peptides. Using SEQUEST, we identified 15,897 unique, annotated peptide-spectrum matches (PSM) to use as a training set.
We first count the number of valid primary peaks with an intensity I (N total (I)) in the training set. From the valid primary peaks, the peaks having isotopic shift were identified and counted (N ISO (I)). The probability that a valid primary peak has an isotopic shift can then be estimated as P (ISO|I, V = 1) = N ISO (I) N total (I) . Similarly, P (ISO|I, V = 0), P(COMP|I, V = 1) and P(COMP|I, V = 0) were estimated and the results are shown in Figure 1, 2, 3, 4.
In Figure 1 an evident nonlinear relationship between primary peak intensity and the existence of isotopic peaks can be observed. The nonlinear relationship can be explained by supposing that, for a primary ion, its isotopic derivative is observed with probability p. Then, for a total of I primary ions, an isotopic derivative would be observed with probability Therefore, it is reasonable to approximate this relationship using an exponential function. Like P(ISO|I, V = 1), P(ISO|I, V = 0) also approximates 1 as the peak intensity goes to infinity. The reason for the slight differences in Figure 1 and 2 is that a contaminant ion might generate an isotopic shift similar to the shift generated by a primary ion. A significantly different pattern between P (COMP|I, V = 1) and P(COMP|I, V = 0) is observed ( Figure 3 and 4) because for contaminant ions, complementary peaks are seldom generated. The relationship between derivative peaks related to neutral losses and primary peak intensity were also calculated and are shown in Figure 5, 6, 7, 8, 9, 10, where b-ion and y-ion are listed separately because they differ in the possibility of neutral losses. The results in the figures indicate that P(ISO|I, V = 1) approximate 1 as the primary peak intensity goes to infinity. On the other hand, P(NH 3 |I, B) and P (NH 3 |I, Y ) approximate a number smaller than 1 and so do P(H 2 O|I, B) and P (H 2 O|I, Y ). The reason for this is that neutral losses are related to the composition of the amino acid ions. Some amino acids can lose ammonia or water, while others cannot [8]. In our study, we have introduced a scale factor to capture the influence of the amino acid composition on neutral losses. Figure 5, 6, 8, and 9 support the earlier observation that b-ions are more likely to have neutral loss than y-ions [40,41]. In summary, noise peaks usually show different patterns from valid peaks, and this observation presents an opportunity for valid peak selection.

Bayesian framework for peak selection
A quantitative description of the dependences was employed to develop a Bayesian framework for peak selection. Let P(V = 1|I, D) denote the probability that a peak is valid given two types of peak attributes, peak intensity I, and the existence of derivative peaks D = {ISO, NH 3 , H 2 O, COMP }. Then, P (V = 1|I, D) was estimated as follows: where p (V = 1) = P(I, D|V = 1)P(V = 1) and p (V = 0) = P (I, D|V = 0)P(V = 0).
Since derivative peaks are variants of primary ions, it is reasonable to assume the independence of different derivative peaks. Therefore, the numerator of the above fraction can be decomposed as: Similarly, the denominator can be rewritten as: Finally, the following approximations were obtained: where r v = P (V = 0|I)/P (V = 1|I).
The relationship between r v (I) and I was calculated using the data set obtained from SwedCAD and the results are shown in Figure 11. A clear geometric distribution was obtained. Similar results were also obtained using data sets from Keller's lab [42] (See Additional File 1: figure S1).

Peak selection based on probPS
We use P(V = 1|I, D) (denoted as probPS) to determine whether or not a peak was valid. For each peak in the training spectra, probPS was calculated, and the distribution of probPS is summarized in Figure 12. It can be observed that a valid primary peak usually has a high probPS value, while a noise peak usually has a low probPS value. Further, peaks with probPS ≥ 0.5 are highly likely to be valid. Therefore, we can utilize the posterior probability to distinguish valid peaks from noise ones. For instance, peaks with higher probPS can be selected to execute the denovo algorithm or for tag identification. The probPS score can also be used to improve database searches by filtering out invalid peaks.
We also compared probPS against the relevance value used in AuDeNS [43]. AuDeNS uses a framework for de novo sequencing of peptides. It first cleans the input spectrum with a number of data cleaning algorithms ("grass mowers"), followed by a sequencing algorithm. It applies the mowers to the input data, assigning to each input peak i a relevance value r(i), with the default being r(i) = 1. Hereby, each mower M uses a relevance factor Rel M (which can be set as a parameter of AuDeNS), and the relevance For the sake of fair comparison, we used the same data sets as AuDeNS, i.e., a training set with 266 LCQ spectra, and a test data set with 20 LCQ spectra. The results of the comparison (shown in Figure 13) suggest that probPS outperforms relevance. Specifically, when the false positive rate is set to 0.2, probPS has a higher true positive rate (0.9) than relevance (0.79).
We also compared the performance of probPS and AuDeNS using two categories of primary peaks, high peaks (peak intensity I >10000) and low peaks (peak intensity I <10000). The results of the comparison are summarized in Figure 14 and 15. It has often been assumed that high intensity peaks are more likely to be valid. However, this is not always true because valid low intensity peaks also exist. For example, ions with small mass/charge ratios, say b + 2 , y + 1 and y + 2 , are generally of low intensity and can even be invisible in ion trap fragmentation spectra. The results in Figure 6 illustrate that probPS is much better than the relevance of AuDeNS for selecting low intensity peaks.

Improving de novo identification using probPS
We investigated whether or not peak selection can improve de novo performance. We ran the de novo algorithm [31] in AuDeNS with two types of spectra, one that was generated through peak selection based on probPS and the other that was generated through peak selection based on relevance. The de novo algorithm in AuDeNS will generate top 30 candidate peptides, and the ranks of correct matched peptides obtained using the two peak selection methods are listed in Table 1. For some spectra, say 03C.1361.1361.2, peak selection using probPS is better than that using AuDeNS because  probPSgave the real peptide a higher rank. For the 01C.0492.0496.2 spectrum, the de novo algorithm failed to identify the correct peptide when peak selection based on relevance was used; in contrast, when probPS was used, the real peptide was identified. Using probPS for peak selection the de novo algorithm correctly identified the top 3 candidate peptides, and overall probPS (reporting 8 correct peptides) outperformed relevance (reporting 6 correct peptides).
We performed cross-validation over the 266 LCQ spectra. The 266 spectra were arbitrarily divided into four groups and in each validation round, three groups were used as the training set, and the remaining group was used as the test set. Because some of the spectra were from the same peptide, which might lead to overfitting, a pre-processing step was performed to ensure that spectra from the same peptide were in the same group. The performance of probPS and AuDeNS in the four validation rounds are listed in Table 2. The results clearly show the advantage of using probPS over AuDeNS for peak selection.

Identifying tags based on probPS
Ordinary tagging methods directly identify tags on a given mass spectrum. For example, PepNovoTag [36] extracts all substrings of the desired length from the PepNovo reconstruction process, and uses a logistic regression model to evaluate these tags. This strategy suffers from noise peaks in the spectrum. Our method only uses the peaks with high probPS values to generate tags. Specifically, our tag identification method (called probTag) starts with the top peaks with high probPS along with their complementary peaks to find the most reliable neighbor peaks. We selected the first 1000 spectra reported by Swed-CAD as the test data set (spectrum IDs from 1.683.39666.2.dta to 1000.1312.70275.2.dta), and used the remaining spectra in SwedCAD as the training data set. Table 3 summarizes the tag identification performance of probTag and PepNovoTag. When the desired tag length was set to 3, probTag found 876 of the 1000 tags correctly while PepNovoTag found 790 tags. When the desired tag length was set to 4, probTag found 760   a Tag identification was "correct" if the tag was contained in the real peptide. Accuracy denotes the ratio of "correct" tag identification.
correct tags while PepNovoTag found 709 tags. When the desired tag length was set to 5, the two methods found almost the same number of correct tags; however, probTag had a higher accuracy (74.67% for probTag compared to 61.0% for PepNovoTag). It should be noticed both PepNovoTag and ProbTag are combinations of peak selection and tagging techniques. This is only an implicit and indirect evidence of the peak selection performance.

Conclusion and discussion
In this study, we described the dependence between derivate peaks and primary ion intensity in a quantitative manner. The experimental results demonstrate that this quantitative description can help improve the accuracy of peak selection which further improves the performance of de novo sequencing and tag identification.
In addition to the peak attributes used in the study, other attributes like, for example, consecutive ions may prove to further improve peak selection. In general, valid peaks are more likely to have a consecutive ion than invalid peaks. In future work, we aim to incorporate this attribute into our peak selection method.

Additional material
Additional file 1: supplementary Figure S1. The relationship between r v (I) and I was also calculated using the data set obtained from an ESI data set provided by Keller. The relationship between r v (I) and I shows similar geometric distribution shape, though parameters are not the same due to different experiment conditions. Caption of the Figure: Distribution of r v (I) calculated using the data set from the Keller's Lab.