A Dynamic Noise Level Algorithm for Spectral Screening of Peptide MS/MS Spectra
© Xu and Michael; licensee BioMed Central Ltd. 2010
Received: 11 May 2010
Accepted: 23 August 2010
Published: 23 August 2010
High-throughput shotgun proteomics data contain a significant number of spectra from non-peptide ions or spectra of too poor quality to obtain highly confident peptide identifications. These spectra cannot be identified with any positive peptide matches in some database search programs or are identified with false positives in others. Removing these spectra can improve the database search results and lower computational expense.
A new algorithm has been developed to filter tandem mass spectra of poor quality from shotgun proteomic experiments. The algorithm determines the noise level dynamically and independently for each spectrum in a tandem mass spectrometric data set. Spectra are filtered based on a minimum number of required signal peaks with a signal-to-noise ratio of 2. The algorithm was tested with 23 sample data sets containing 62,117 total spectra.
The spectral screening removed 89.0% of the tandem mass spectra that did not yield a peptide match when searched with the MassMatrix database search software. Only 6.0% of tandem mass spectra that yielded peptide matches considered to be true positive matches were lost after spectral screening. The algorithm was found to be very effective at removal of unidentified spectra in other database search programs including Mascot, OMSSA, and X!Tandem (75.93%-91.00%) with a small loss (3.59%-9.40%) of true positive matches.
Shotgun proteomics has gained increasing interest and become one of the most widely used tools in mass spectrometry (MS) based proteomics [1, 2]. A large amount of data can be generated in high-throughput shotgun proteomics experiments. The analysis of these data presents many challenges. For example, high-throughput shotgun proteomics data contain a significant number of spectra from non-peptide ions or spectra of too poor quality to obtain highly confident peptide identifications. These spectra cannot be identified with any positive peptide matches in some database search programs or are identified with false positives in others . Furthermore, these spectra consume data analysis time when searching the data set. Therefore, removing these spectra can improve the database search results and lower computational expense.
There have been many reports of algorithms used to filter poor quality tandem mass spectra. Moore et al developed an empirical model to assess the quality of tandem mass spectra prior to database search . An apparent disadvantage of this model was that different selection criteria and different empirical parameters were needed for different mass spectrometers. Bern et al developed another algorithm to predict the quality of tandem mass spectra before database search. Their algorithm was able to filter 75% of the unidentified spectra of poor quality while keeping 90% of the identified spectra . Wong et al reported an approach to assess spectral quality based on logistic regression using various spectral features . This approach can be used to assess spectral quality and to filter poor quality spectra prior to database search. Purvine et al developed a spectral quality assessment method to filter tandem MS data prior to database search based on three features of a spectrum: 1) charge state differentiation, 2) total signal intensity, and 3) signal-to-noise estimates . The noise in a spectrum was approximated to be the average intensity of the lower half of the peaks in the spectrum. The estimation can be heavily biased when too few or too many signal peaks of high abundance existing in the spectrum. Flikka et al used machine learning approach to differentiate poor quality spectra from good ones using various spectral feature of a tandem MS spectrum, including number of peaks, peak abundances and their standard deviation, precursor charge state, average m/z value and etc . This method filtered up to 62% of unidentified spectra and was less efficient in filtering poor quality spectra compared to other methods.
Here we report a new dynamic noise level (DNL) algorithm, which is capable of filtering spectra of poor quality. The algorithm determines a noise level for each spectrum in a tandem MS data set by testing peaks from the lowest to highest. Based on that noise level, the algorithm then determines the number of signal peaks for the spectrum and its resulting quality. Poor quality spectra are excluded from further analysis. The algorithm was tested on a large tandem MS data set containing 62,117 spectra. Overall, the filtering achieved a significant reduction in false positives and unidentified spectra resulting in shorter database search times.
Algorithm and Implementation
Dynamic Noise Level (DNL) Algorithm
- 1.For peak k being scanned, the previous k-1 peaks have been determined as noise by the previous scans. If k equals 2, the abundance of the second peak predicted by the first noise peak is calculated by the following equation,(1)
- 2.if k is greater than 2, A linear regression model is fitted to the abundances of the previous k-1 sorted noise peaks,(2)
where, matrices and ;
- 3.The signal-to-noise ratio (SNR) of peak k is estimated by the ratio of the observed peak abundance I k to the predicted peak abundance by assuming it is noise, i.e.(5)
where is calculated from equation 1 when k equals 2 and equation 4 when k is greater than 2. If the estimated SNR is greater than the threshold SNRmin, peak k is determined to be signal and is defined to be the noise level of this spectrum. Otherwise, it is noise and the scanning continues.
As a general rule of thumb, the minimum SNR for signal peaks in a tandem mass spectrum, SNRmin, was set to be 2. The SNR threshold can be adjusted in more or less aggressive filtering as desired.
DNL algorithm can be used to screen tandem mass spectra of poor quality prior to database search. Based on assumption 1, all peaks with abundances greater than or equal to the first signal peak will be considered as signal. If the number of signal peaks, n, of the spectrum is below the threshold nmin, the spectrum will be filtered. As a rule of thumb, the minimum number of signal peaks for a spectrum of good quality, nmin, was set to be 8. The parameter nmin can also be adjusted to allow for more or less aggressive filtering.
The DNL spectral screening algorithm described herein was implemented in a standalone program written in C++. The windows version of the program is freely available at http://www.massmatrix.net/download/. The algorithm is also incorporated in the MassMatrix database search engine for on-the-fly spectral screening during database search.
Sample Preparation and Mass Spectrometry
Bovine histones were isolated from bovine thymus tissue as described by Sures et al[8, 9]. The mixture of bovine histones was digested by trypsin in 100 mM ammonium bicarbonate buffer (pH = 8.0). Enzymes were used in 25:1 ratio (substrate:enzyme) and the mixture was incubated at 37°C for two hours. The digested peptides were identified using data-dependent nano-LC-MS/MS on a LCQ Deca XP ion trap mass spectrometer (ThermoFisher, San Jose, CA, USA). 2.0 μL of bovine histone peptides with a total concentration of 0.1 μg/μL was injected into the LC-MS/MS system and eluted off the capillary HPLC column into the LCQ mass spectrometer using a linear gradient. Solvent A was water with 0.1% acetic acid and solvent B was acetenitrile with 0.1% acetic acid. Ions were fragmented by use of collision induced dissociation.
Database Search and Search Parameters
The RAW data files collected on the mass spectrometer were converted to MGF files and merged into a single large MGF file by use of MassMatrix data conversion tools http://www.massmatrix.net/download. The merged MGF file contained 62,117 tandem mass spectra. Tandem mass spectra that were not derived from singly charged precursor ions were searched as both doubly and triply charged precursors. Therefore, some spectra were searched with both +2 and +3 charges. This resulted in 86,147 tandem mass spectra in the data set to be processed and searched. The data set was filtered by the dynamic noise level algorithm. Both the original and filtered data sets were then searched by use of MassMatrix [10–12] (version 1.0.0, http://www.massmatrix.net against a protein database containing both the bovine histone database (117 proteins) and a decoy reversed human database (96,997 proteins) using the following options: i) No variable or fixed modifications; ii) Enzyme: trypsin; iii) Missed Cleavages: 3; iv) Peptide Length: 6 to 30 amino acid residues; and v) Mass tolerances of 2.0 Da and 0.8 Da for the precursor and product ions respectively. For each spectrum, the highest scored peptide match was assumed to be the best peptide hit.
The data sets were also evaluated by use of Mascot , OMSSA , and X!Tandem . The counterpart search parameters in Mascot, OMSSA, and X!Tandem were identical to those in MassMatrix. For X!Tandem searches, refinement was enabled and performed for the peptide matches with expectation values greater than or equal to 1.0.
Results and Discussion
The performance of the DNL algorithm was tested with a merged data set consisting of 23 shotgun proteomic experiments on bovine histone tryptic digests containing 62,117 total tandem mass spectra. The quality of the tandem mass spectra in the data set was evaluated by database searches in MassMatrix against a database containing a bovine histone database and a decoy reversed human database to eliminate the bias of manual evaluation. The decoy reversed human database created ~1000 times as many theoretical peptides as the bovine histone database. Therefore, false positive peptide matches from the bovine histone database were assumed to be negligible. The peptide matches returned from the bovine histone database were considered to be true positives (TPs), while those from the decoy database were, therefore, considered as false positives (FPs) . The tandem mass spectra in the data set were classified into three categories: spectra identified with TPs, spectra identified with FPs, and spectra with no significant matches (unidentified spectra).
The effect of δ on spectral screening is ignorable due to the fact that majority of tandem MS spectra for peptides contain more than two noise peaks and the lowest noise peaks are not extreme compared to the higher noise peaks. For the merged data set containing 62,117 spectra from 23 experiments, the extreme setting of δ equal to 0 resulted in < 1% loss in sensitivity, i.e. the success rate of filtering bad spectra. The extreme setting of δ > 1.0 resulted in < 0.01% loss in specificity, i.e. the success rate of keeping good spectra. Therefore, a fixed intermediate setting of δ = 0.5 is used in the current implementation of DNL algorithm.
Area under the curve (AUC) for the ROC curves indicates the overall discrimination power of the DNL spectral screening algorithm.
The ROC curve of SNR equal to 2 in Figure 3 is the sensitivity vs (1 - specificity) plot at various threshold settings of nmin, i.e. the number of signal peaks. For singly charged spectra, a threshold nmin equal to 9 has a specificity of 94.72% (i.e. false rate 5.28%) and a sensitivity of 85.67%. For doubly/triply charged spectra, a threshold nmin equal to 7 has a specificity of 95.31% (i.e. false rate 4.69%) and a sensitivity of 73.69%. For all spectra, an overall threshold nmin euqal to 8 achieved a specificity of 94.06% (i.e. false rate 5.94%) and a sensitivity of 80.07%. Applying different optimal nmin thresholds for spectra with different charges provided very limited improvements with regard to sensitivities and specificities. Therefore, the current implementation of DNL algorithm does not support applying different settings for spectra with different charges. A setting of nmin equal to 8 will be used for all spectra in the discussion herein.
The robustness of the DNL algorithm over different experiments was evaluated by the ROC analysis of the 23 individual tandem MS data sets as provided in the additional file [see Additional file 1]. It can be seen that the DNL algorithm achieved overall good power of discriminating good quality spectra from bad quality ones for all the data sets used.
A new dynamic noise level (DNL) algorithm has been developed to remove tandem mass spectra of poor quality. The algorithm was evaluated with a large data set that contained 62,117 spectra and was searched by MassMatrix against a database containing true protein sequences and a large decoy database. The algorithm determined the noise level dynamically and independently for each spectrum in tandem MS data. The distribution of noise in the spectra from the large test data set showed that the noise levels for tandem mass spectra varied significantly from one to another for ion trap mass spectrometry data. The algorithm assessed the quality of spectra based on the number of signal peaks and filtered those with less than 8 signal peaks. It was found that 89.0% of unidentified spectra in the MassMatrix database search program were successfully filtered while only 6.0% of spectra with true positive matches were removed upon DNL spectral screening. The algorithm was also found very effective at removal of unidentified spectra (75.93%-91.00%) in other database search programs including Mascot, OMSSA, and X!Tandem at a small loss (3.59%-9.40%) true positive matches.
Availability and Requirements
Project name: Dynamic Noise Level Algorithm.
Project home page: http://www.massmatrix.net/.
Operating systems: Windows.
Programming language: ANSI C++.
Other requirements: None.
Any restrictions to use by non-academics: None.
The authors thank Mitchell Meade and Lanhao Yang for providing the data sets. The study was funded by The Ohio State University, the National Institutes of Health (CA107106, CA101956), the V Foundation (AACR Translational Cancer Research Grant), the Leukemia & Lymphoma Society, (SCOR), the University of Illinois at Chicago, and the Searle Funds at the Chicago Community Trust to the Chicago Biomedical Consortium.
- Aebersold R, Mann M: Mass spectrometry-based proteomics. Nature 2003, 422: 198–207. 10.1038/nature01511View ArticlePubMedGoogle Scholar
- Sadygov RG, Cociorva DC, Yates JR: Large-scale database searching using tandem mass spectra: Looking up the answer in the back of the book. Nature Methods 2004, 1(3):195–202. 10.1038/nmeth725View ArticlePubMedGoogle Scholar
- Moore RE, Young MK, Lee TD: Method for screening peptide fragment ion mass spectra prior to database searching. J Am Soc Mass Spectrom 2000, 11: 422–426. 10.1016/S1044-0305(00)00097-0View ArticlePubMedGoogle Scholar
- Bern M, Goldberg D, McDonald WH, Yates JR: Automatic quality assessment of peptide tandem mass spectra. Bioinformatics 2004, S20(1):i49-i54. 10.1093/bioinformatics/bth947View ArticleGoogle Scholar
- Wong JWH, Sullivan MJ, Cartwright HM, Cagney G: msmsEval: tandem mass spectral quality assignment for high-throughput proteomics. BMC Bioinformatics 2007, 8: 51. 10.1186/1471-2105-8-51View ArticlePubMedPubMed CentralGoogle Scholar
- Purvine S, Kolker N, Kolker E: Spectral quality assessment for high-throughput tandem mass spectrometry proteomics. OMICS 2004, 8(3):255–265. 10.1089/omi.2004.8.255View ArticlePubMedGoogle Scholar
- Flikka K, Martens L, Vandekerckhove J, Gevaert K, Eidhammer I: Improving the reliability and throughput of mass spectrometry-based proteomics by spectrum quality filtering. Proteomics 2006, 6: 2086–2094. 10.1002/pmic.200500309View ArticlePubMedGoogle Scholar
- Sures I, Gallwitz D: Histone-specific acetyltransferases from calf thymus. isolation, properties, and substrate specificity of three different enzymes. Biochem 1980, 19: 943–951. 10.1021/bi00546a019View ArticleGoogle Scholar
- Zhang LW, Freitas MA, Wickham J, Parthun MR, Klisovic MI, Marcucci G, Byrd JC: Differential expression of histone post-translational modifications in acute myeloid and chronic lymphocytic leukemia determined by high-pressure liquid chromatography and mass spectrometry. J Am Soc Mass Spectrom 2004, 15: 77–86. 10.1016/j.jasms.2003.10.001View ArticlePubMedGoogle Scholar
- Xu H, Freitas MA: A Mass Accuracy Sensitive Probability Based Scoring Algorithm for Database Searching of Tandem Mass Spectrometry Data. BMC Bioinformatics 2007, 8: 133. 10.1186/1471-2105-8-133View ArticlePubMedPubMed CentralGoogle Scholar
- Xu H, Freitas MA: Monte Carlo simulation based algorithms for analysis of shotgun proteomic data. J Proteome Res 2008, 7(7):2605–2615. 10.1021/pr800002uView ArticlePubMedPubMed CentralGoogle Scholar
- Xu H, Freitas MA: MassMatrix: A database search program for rapid characterization of proteins and peptides from tandem mass spectrometry data. Proteomics 2009, 9(6):1548–1555. 10.1002/pmic.200700322View ArticlePubMedPubMed CentralGoogle Scholar
- Perkins DN, Pappin DJC, Creasy DM, Cottrell JS: Probability-based protein identification by searching sequence database using mass spectrometry data. Electrophoresis 1999, 20: 3551–3567. 10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2View ArticlePubMedGoogle Scholar
- Geer LY, Markey SP, Kowalak JA, Wagner L, Xu M, Maynard DM, Yang X, Shi W, Bryant SH: Open mass spectrometry search algorithm. J Proteome Res 2004, 3: 958–964. 10.1021/pr0499491View ArticlePubMedGoogle Scholar
- Craig R, Cortens JP, Beavis RC: Open source system for analyzing, validating, and storing protein identification data. J Proteome Res 2004, 3(6):1234–1242. 10.1021/pr049882hView ArticlePubMedGoogle Scholar
- Elias JE, Gygi SP: Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nature Methods 2007, 4(3):207–214. 10.1038/nmeth1019View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.