Quality assessment of tandem mass spectra using support vector machine (SVM)

Background Tandem mass spectrometry has become particularly useful for the rapid identification and characterization of protein components of complex biological mixtures. Powerful database search methods have been developed for the peptide identification, such as SEQUEST and MASCOT, which are implemented by comparing the mass spectra obtained from unknown proteins or peptides with theoretically predicted spectra derived from protein databases. However, the majority of spectra generated from a mass spectrometry experiment are of too poor quality to be interpreted while some of spectra with high quality cannot be interpreted by one method but perhaps by others. Hence a filtering algorithm that removes those spectra with poor quality prior to the database search is appealing. Results This paper proposes a support vector machine (SVM) based approach to assess the quality of tandem mass spectra. Each mass spectrum is mapping into the 16 proposed features to describe its quality. Based the results from SEQUEST, four SVM classifiers with the input of the 16 features are trained and tested on ISB data and TOV data, respectively. The superior performance of the proposed SVM classifiers is illustrated both by the comparison with the existing classifiers and by the validation in terms of MASCOT search results. Conclusion The proposed method can be employed to effectively remove the poor quality spectra before the spectral searching, and also to find the more peptides or post-translational peptides from spectra with high quality using different search engines or de novo method.


Background
With the development of proteomics, tandem mass spectrometry (MS/MS) has been used for the rapid identification and characterization of protein components of complex biological mixtures. Several database search pro-grams such as SEQUEST [1] and MASCOT [2] have been developed to identify peptides by comparing the mass spectra obtained from unknown proteins or peptides with theoretically predicted spectra derived from protein databases. However, it is well known that these search programs produce a significant number of incorrect peptide assignments and leave the majority of spectra uninterpreted. One of the reasons this happens is that the majority of spectra generated from a mass spectrometry experiment are of too poor quality to be interpreted. The process of evaluating peptide assignments often relies on time-consuming and experience-dependent manual verification. Hence a filtering algorithm that removes those spectra with poor quality prior to the database search is appealing.
During the past few years, there have been a number of studies concerning the evaluation of the results of various search programs. Moore et al. described a probabilistic scoring scheme called Qscore to evaluate SEQUEST database search results [3]. Keller et al. applied the expectation maximization algorithm to estimate the accuracy of peptide identifications [4]. Anderson et al. employed the support vector machine (SVM) to distinguish between correctly and incorrectly identified peptides obtained by SEQUEST search program [5]. Razumovskaya et al. developed a method by combining a neural network and a statistical model to normalize SEQUEST scores and to provide reliability estimation for SEQUEST hits [6]. More recently, Nesvizhskii et al. described a dynamic quality scoring approach for finding high quality unassigned spectra in large shotgun proteomic datasets [7].
The earliest work concerned with the quality assessment of tandem mass spectra prior to database search was reported by Tabb et al. [8]. They assessed the spectral quality by use of some simple rules such as minimum and maximum thresholds on the number of peaks and a minimum threshold on total peak intensity. They claimed that such rules could remove 40% or more of the poor quality spectra. Purvine et al. used a pre-filtering algorithm named SPEQUAL with three features for tandem mass spectral quality assessment [9]. These three features were charge state differentiation, total signal intensity, and signal-tonoise estimates. They claimed that 55% of the poor quality spectra could be safely eliminated from further analysis by employing the SPEQUAL algorithm. Bern et al. proposed two different classification schemes for the automatic spectral quality assessment [10]. One scheme used the linear Fisher analysis to construct a classifier based on seven features including Npeaks, Total Intensity, Good-Diff Fraction, Isotopes, Complements, Water Losses, and Intensity Balance. The other one employed the SVM classifier based on observed mass/charge (m/z) ratios. The best result reported by Bern et al. [10] is that their SVM based classifier could remove 75% of the poor quality spectra while losing 10% of the high quality ones.
More recently, Flikka et al. [11] presented a filtering algorithm to eliminate the poor quality spectra before the database search. They tested and compared several classifiers on various proteome datasets (Q-TOF, ESI IT, and MALDI-TOF) from different instruments, and the best results from the classification test using ESI IT dataset showed that 83% of the poor quality spectra could be removed while losing 10% of the high quality ones. Salmi et al. [12] proposed a pre-filtering scheme for evaluating the quality of spectra before the database search, and they obtained the minimum false positive rate (FPR) of 25% while fixing the true positive rate (TPR) at 90%. Na et al. [13] proposed a machine learning approach to assess spectral quality by use of three spectral features which were Xrea based on cumulative intensity normalization and Good-Diff Fraction proposed by Bern et al. [10] for singly charged and doubly charged fragment ions. Na et al. [13] claimed that their method could filter out 75% of poor quality spectra while losing 10% of high quality ones when evaluating it on the ISB dataset. In [14], a probability based approach called msmsEval was proposed to assess the quality of tandem mass spectra. Using the ISB dataset as the classification test data, the TNR was obtained at about 83% while the TPR was 90%. This paper investigates the quality assessment of tandem mass spectra. The spectra are classified into two groups: high quality and poor quality spectra. In general, a spectrum is called to be of high quality if it is able to be identified by some methods, and otherwise it is called to be of poor quality. Several spectral features are proposed for the classification, and the SVM is applied to solve this classification problem. The results of computational experiments on two different mass spectral datasets (ISB and TOV) show that the proposed method can remove the majority of the poor quality spectra while losing a small minority of the high quality ones.

Spectral features
A mass spectrum usually contains tens to hundreds of m/ z values on the x-axis, each with corresponding signal intensity on the y-axis. In this study, after removing the noisy peaks by use of the morphological reconstruction method [15], 16 spectral features are introduced as follows for a spectrum.   The number of peaks with relative intensity >0.1, square root-transformed. In this study, the relative intensity of each peak is defined as the peak's intensity divided by the intensity of the highest peak. The log or square root transformation of the above spectral features was employed to obtain a more symmetric shape of the distribution and to minimize the variance across spectra in a mass spectral dataset. The experiments also verified that such transformation improved the performance of the spectral quality assessment by using the proposed SVM method.
To develop the remaining 12 features, four variables for a given peptide mass spectrum S are defined as where m(x) and m(y) denote the m/z-values of peaks x and y in the spectrum S, respectively; m(H) is the mass of a hydrogen atom. A weighting factor is defined as where I r (x) and I y (x) represent the relative intensities of peaks x and y in the spectrum S, respectively.
F 5 -F 7 : Amino acid distances. These features measure how likely two peaks in a spectrum S differ by one of the twenty amino acids. Define where M i (i = 1, 2,ʜ,17) are the 17 different masses of all 20 amino acids. This study considers all Methionine amino acids to be sulfoxidized and does not distinguish three pairs of amino acids in their masses: Isoleucine vs. Leucine, Glutamine vs. Lysine, and sulfoxidized Methio-nine vs. Phenylalanine since the masses of each pair are very close. The comparison implied by  employs a tolerance, which was set to ± 0.5 Da for fragment ions and ± 2 Da for parent mass in this paper. The feature F 5 measures the presence of peak pairs of singly charged ions corresponding to an amino acid mass difference in the spectrum S; the feature F 6 measures the presence of peak pairs of doubly charged ions corresponding to an amino acid mass difference in the spectrum S, and the feature F 7 measures the presence of peak pairs of one doubly charged and the other singly charged ions corresponding to an amino acid mass difference in the spectrum S. The use of the weighting factors in the features is to account the increased likelihood of more intense peaks being true fragment ions.
F 8 -F 10 : Complements. These features measure how likely an N-terminus ion and a C-terminus ion in the spectrum S are produced as the peptide fragments at the same peptide bond. Define where M p is the mass of the precursor ion of the spectrum S. The feature F 8 measures the presence of complementary peak pairs of singly charged ions in the spectrum S; the feature F 9 measures the presence of complementary peak pairs of doubly charged ions in the spectrum S, and the feature F 10 measures the presence of complementary peak pairs of one doubly charged and the other singly charged ions in the spectrum S. F 11 -F 13 : Water or ammonia losses. These features measure how likely one ion in the spectrum S is produced by losing a water or ammonia molecule from a b-ion or y-ion. Define where M w and M a are the masses of a water molecule and an ammonia molecule, respectively. The feature F 11 meas- ures the presence of peak pairs of singly charged ions with a difference of a water or ammonia molecule in the spectrum S; the feature F 12 measures the presence of peak pairs of doubly charged ions with a difference of a water or ammonia molecule in the spectrum S, and the feature F 13 measures the presence of peak pairs of one doubly charged and the other singly charged ions with a difference of a water or ammonia molecule in the spectrum S. F 14 -F 16 : Supportive ions. These features measure how likely one ion in the spectrum S is a supportive ion. This paper considers two kinds of supportive ions a-ions and zions. Define where M CO and M NH are the masses of a CO group and an NH group, respectively. The feature F 14 measures the presence of peak pairs of singly charged ions with a difference of a CO or NH group in the spectrum S; the feature F 15 measures the presence of peak pairs of doubly charged ions with a difference of a CO or NH group in the spectrum S, and the feature F 16 measures the presence of peak pairs of one doubly charged and the other singly charged ions with a difference of a CO or NH group in the spectrum S.
The four features F i (i = 5, 8,11,14) represent the evidence of the existence of singly charged ions, and the eight features F i+1 and F i+2 (i = 5, 8,11,14) represent the evidence of the existence of doubly charged ions.
These twelve features are developed according to the properties of the theoretical spectra proposed in our previous study [16] where the peak intensities have not been considered though. The experiments in this study showed that the use of the peak intensities improved the performance of the spectral quality assessment by using the SVM method. In general, the high quality spectra are expected to have larger values of these twelve features than those of the poor quality spectra. In addition, the more intense the peak pairs, the larger the values of these twelve spectral features are. At this point, 16 spectral features are introduced to describe the spectral quality. It is noted that the larger the number of the spectral peaks, the larger the values of the spectral features F 3 and F 5 -F 16 are. This likely leads to a low sensitivity of the classifier as the high quality spectra for a spectrum with smaller number of peaks that would have smaller values of spectral features F 3 and F 5 -F 16 . To alleviate these effects, these spectral features are transformed as where  is a small positive constant, and is set  = 0.01 in this study. In a spectrum, a possible m/z range in which doubly charged ion peaks exist is less than a half of its peptide mass. Therefore, while we compute features F 6 , F 12 , and F 15 , the following conditions should be satisfied

Classification method
In this paper, the support vector machine is applied to assess the spectral quality because of its good generalization ability. The SVM was proposed by Vapnik based on the statistical learning theory [17]. An important characteristic of the SVM is that "while most classical neural network algorithms require an ad hoc choice of system's generalization ability, the SVM approach proposes a learning algorithm to control the generalization ability of the system automatically" [18]. The training of an SVM requires the solution of a quadric programming (QP) optimization problem, which is a large-scale system optimization problem. The sequential minimal optimization (SMO) decomposes the overall QP problem into fixedsize QP sub-problems (each involves only two Lagrange multipliers), and these sub-problems are solved analytically [19]. The SMO algorithm is one of the efficient algorithms for solving the large QP problem, which is employed to train the SVM in this work.
For an input vector x  R n (n = 16 in this paper), a decision can be made by a well-trained SVM as where x i  R n (i = 1, 2,ʜ, l) are the support vectors;  i (i = 1, 2,ʜ, l)) are the Lagrange multipliers; y i  {-1, +1}(i = 1, 2,ʜ, l)) are the corresponding class of pattern for the support vector x i , and -1 for poor quality spectra and +1 for high quality spectra in this paper; K(x i , x)(i = 1, 2,ʜ, l)) are the kernel functions, and b is the threshold. In this paper, the non-thresholded output g(x) is employed to generate the receiver operating characteristics (ROC) curve by using the algorithm proposed by Fawcett [20]. To evaluate the performance of the SVM classifiers, two correct rates log( ) , , , (20) are calculated in this study: true positive rate (TPR) and true negative rate (TNR) where TP is the number of true positives; FP is the number of false positives; TN is the number of true negatives, and FN is the number of false negatives.

Experimental data
This study used two different proteome datasets: the ISB dataset and the TOV dataset.

ISB dataset
The ISB dataset used in this study was acquired on an ion trap and was provided by the Institute of Systems Biology (ISB, Seattle, USA). The ISB dataset consists of 37043 peptide collision-induced dissociation (CID) mass spectra, which were generated by the tryptic digestion of a control mixture of standard 18 proteins (not of human origin) [4]. These spectra were searched against a human protein database (extracted from [21]) appended with the sequences of the 18 standard proteins and other common contaminants (totally, 5395 protein sequences in the final database) using SEQUEST search program. The single charged spectra were excluded from this study as the number of the singly charged spectra (only 504 singly charged) is too small. The distribution of multiple charged spectra is shown in Table 1. 'H' represents the number of the high quality spectra, and 'P' represents the number of the poor quality spectra. For the ISB dataset, this study has trained three SVM classifiers, one for doubly charged, one for triply charged, and the other for multiply charged spectra (not distinguishing between the doubly charged and triply charged ions).

TOV dataset
This dataset consists of 13467 peptide CID mass spectra which were acquired on an LCQ DECA XP ion trap (Thermo Electron Corp.) in Eastern Quebec Proteomic Center in Laval University Medical Research Center in Canada. The samples analyzed were generated by the tryptic digestion of a whole-cell lysate from the human malignant epithelial ovarian tumor cell-line TOV-112D [22]. These spectra were searched against a subset of the Uniref100 database, release 1.2 [23] including 44278 human protein sequences using SEQUEST. The assignments of 866 spectra were verified to be correct by Pepti-deProphet with the cut-off score of 0.8 [4], and were labeled as "high" quality spectra. The other 12601 spectra in TOV were labeled as poor quality. The distribution of these is shown in Table 2. For the TOV dataset, this study has only trained a classifier for the doubly charged spectra because the number of the high quality singly charged and triply charged spectra is too small to train a reasonable classifier.

Results and discussion
Four separate SVM classifiers were trained in this study. The first SVM was trained for the doubly charged spectra; the second one was trained for the triply charged spectra; the third one was trained for the multiply (both doubly and triply) charged spectra. These three classifiers were trained based on the ISB dataset. The fourth one was trained for the doubly charged spectra of the TOV dataset. This study employed the radial basis functions (RBF) whose width parameter was set equal to 0.1 as the kernel functions of these four SVMs. The penalty term for the training set errors was set to 100.
It is noted that the class distribution of the tandem mass spectra is highly imbalanced, i.e., the number of the poor quality spectra is much larger than that of the high quality spectra. If one randomly chosen a certain number of spectra from the dataset to train an SVM, one would obtain a higher TNR and a lower TPR by the trained SVM classifier.  To get a higher TPR, a larger number of high quality spectra are required to train an SVM classifier, which results in a significant increase of the number of the training samples. Therefore, much longer time will be taken to train the SVM. To overcome these problems and give an equal opportunity to the high quality and the poor quality spectra for the training of the SVM, this study employed the same number of high quality and poor quality spectra as the training samples. For example, in the ISB dataset only 860 (430 high quality and 430 poor quality) spectra were used to train the SVM classifiers for the doubly charged spectra while 600 (300 high quality and 300 poor quality) spectra were used to train the SVM classifier for triply charged spectra. The number of the samples in the training and test sets for all SVM classifiers is shown in Table 3. 'SVM2ISB' stands for the SVM classifier for the doubly charged ISB spectra; 'SVM3ISB' stands for the SVM classifier for the triply charged ISB spectra; 'SVMMISB' stands for the SVM classifier for the multiply charged ISB spectra, and 'SVM2TOV' stands for the SVM classifier for the doubly charged TOV spectra.
In this study we repeated to train and test each SVM classifier on 20 randomly sampled datasets to investigate the performance of the proposed methods. The results are shown in Tables 4, 5, 6, 7. In these tables, 'Ave.' stands for the average and 'SD' for the standard deviation. For the doubly charged spectra of the ISB dataset, Table 4 shows that the proposed method can eliminate about 89% of the poor quality spectra while losing less than 6% of the high quality spectra at the best case. On average the SVM classifier can eliminate about 90% of the poor quality spectra while losing less than 8% of the high quality spectra. For the triply charged spectra of the ISB dataset, Table 5 shows that the proposed method can remove more than 87% of the poor quality spectra while losing about 4% of the high quality spectra at the best case. On average it can remove about 88% of the poor quality spectra while losing about 7% of the high quality spectra. Table 6 shows that the pro    posed method can remove over 87% of the poor qualitymultiply charged spectra (not distinguishing between doubly charged and triply charged ions) while losing less than 8% of the high quality ones by using the ISB dataset as the classification test at the best case. On average it can remove over 87% of the poor quality multiply charged spectra while losing about 9% of the high quality ones. For the TOV dataset, Table 7 shows that the developed SVM classifier can remove about 85% of the poor quality spectra while losing less than 5% of the high quality ones at the best case. On average about 84% of the poor quality spectra can be removed while losing less than 8% of the high quality spectra by using the TOV dataset as the classification test. In summary, the four SVM classifier developed in this study performs very well. In addition, comparing Table 6 with Tables 4 and 5, it indicates that the information about the charge state of the precursor ions can be used to improve the performance of the SVM classifiers. Figures 1 and 2 show the ROC curves for the SVM classifier for multiply charged ISB dataset and the SVM classifier for doubly charged TOV dataset, respectively. Even if only 2% loss of high quality spectra is allowed, the proposed method can filter out about 70% of poor quality multiply charged ISB spectra and 65% of poor quality doubly charged TOV spectra, respectively. Table 8 gives the correct rates of the proposed method and some of the existing methods. From Table 8, it can be seen that the performances of the methods early proposed by Tabb et al. [8] and Purvine et al. [9] are not good and do not give the values of TPRs. While the TPR is fixed at 90% (i.e., the percentage of losing high quality spectra is fixed at 10%), the best results reported by the existing methods from [11] and [14] are that the TNR is about 83%. For the ISB dataset, the TNRs in [13] and [14] are obtained at 75% and 83%, respectively. However, for the multiply charged spectra of ISB dataset, on average the proposed method can remove more than 87% of the poor quality spectra while losing about 9% of the high quality ones. This illustrates that the proposed method outperforms these existing methods.
In addition, Wong et al. [14] did not report the performance of their proposed method for randomly sampled datasets. Bern et al. [10] just reported the best results from their methods. In [11] and [13], the 5-fold cross validation was employed to test their proposed approaches. However, they did not report the entire results across per test fold. In [12], the 10-fold cross-validation was applied to assess the performance of their proposed method. While the TPR is fixed about 90%, the TNR is varied between 43% and 75%, which means that their proposed method is highly sensitive to the variations of the training and test sets in the same dataset. However, the last row in each of Tables 4, 5, 6, 7 gives the standard deviation of our proposed methods over twenty randomly sampled datasets. All the standard deviations for TPR and TNR are very small (from 0.65%-1.67%). This indicates that the proposed method is insensitive to the variations of the training and test sets in the same dataset.
The SVM classifiers are trained with the results of SEQUEST which may also have false positive or false negative. To further illustrate the bias raised from the SEQUEST results, we have investigated the spectra in the false positive set and the false negative set, respectively. For this purpose, we randomly selected 100 false positive spectra (50 doubly charged and 50 triply charged) from the ISB dataset, which are classified as high quality by the proposed method, yet were unidentified by the SEQUEST search program. These 100 spectra were re-searched by on-line MASCOT [24] against the SwissProt database. The parent mass tolerance was set at ± 2 Da, and the fragment ion mass tolerance was set at ± 0.5 Da. The enzyme parameter was set as tryptic sequences, and the maximum of missed cleavage site was 1. We found that 14 doubly charged and 8 triply charged spectra had peptide-spectrum matching scores over the cut-off for peptides with significant homology. This indicates that 22% of false positives may be true positive, thus the TNR of our method should be higher. In addition, the 15 spectra out of these 22 spectra are interpreted as the same peptides by SEQUEST (below the cut-off score) and MASCOT (above the cut-off score).
We also randomly selected 10 false negative spectra, which have high matching scores from SEQUEST method, yet are classified as poor quality by the proposed method. These 10 spectra were also re-searched by on-line MAS-COT [24] against the SwissProt database. We found that the MASCOT ion scores of all these ten spectra were less than 15. This indicates that these 10 spectra may be true negative. To confirm this indication, manual verification by a mass spectrometry expert is required. Figure 3 shows one of these spectra, which was interpreted as doubly charged ion with peptide 'VAGTWYSLAMAASDI SLLDAQSAPLR' by SEQUEST with a high Xcorr score 3.4848. However, its MASCOT ion score is as low as 2. From Figure 3, it is obvious that this spectrum is poor ROC curve for the SVM classifier for multiply charged ISB spectra Figure 1 ROC curve for the SVM classifier for multiply charged ISB spectra. Even if only 2% loss of high quality multiply charged spectra is allowed, the proposed method can filter out about 70% of the poor quality ones.

Conclusion
In this paper, an SVM-based method is proposed for assessing the quality of tandem mass spectra from ion trap mass spectrometers. 16 spectral features are introduced to describe the quality of peptide mass spectra. Each spectrum is mapped into a 16-dimensional feature vector. The SVM is applied to construct the classifier in the feature space that distinguishes the high quality from the poor quality of peptide mass spectra. Four separate SVM classifiers are trained and tested on two different mass spectral datasets: ISB and TOV datasets. Computational experimental results have demonstrated the effectiveness of the proposed method and indicated that the proposed method outperforms the existing methods.
The significance of the proposed method is three-fold. First, the proposed method provides a reliable evaluation of the spectral quality. Therefore, the poor quality spectra can be filtered out before database search, which significantly reduces the computational time on spectral searching. Second, the proposed method can be employed to evaluate the database search results from one search engine while incorporating with different identification methods. For example, by both re-searching false negative spectra with MASCOT and manual verification, we can confirm that assignments of these spectra by SEQUEST are actually false. Third, the proposed method can be used to identify more significant peptide-spectrum assignments.
For example, in this study by searching the false positive spectra (which are determined to be of high quality by the proposed method) with MASCOT, about 22% of these spectra are identified. Although all database searching assignments and our proposed method did not take the post-translationally modified amino acids except for sufloxidized Methionine in this study, our method still can be beneficial in identifying post-translationally modified peptide/proteins. Actually each protein has only a few modified amino acids and each peptide has much fewer modified amino acids. The values of our proposed features may be affected little by only a few modified amino acids. Because of the robustness of our methods, the spectra with modified amino acids are likely determined as high quality. Note that the number of high quality spectra determined by the proposed method is much less than the original spectral dataset. Therefore, it would save a significant amount of time to find post-translationally modified peptides/proteins by just searching those high quality spectra.
A false negative spectrum from the ISB dataset Figure 3 A false negative spectrum from the ISB dataset.