Tandem mass spectrometry data quality assessment by self-convolution
© Choo and Tham; licensee BioMed Central Ltd. 2007
Received: 16 October 2006
Accepted: 20 September 2007
Published: 20 September 2007
Many algorithms have been developed for deciphering the tandem mass spectrometry (MS) data sets. They can be essentially clustered into two classes. The first performs searches on theoretical mass spectrum database, while the second based itself on de novo sequencing from raw mass spectrometry data. It was noted that the quality of mass spectra affects significantly the protein identification processes in both instances. This prompted the authors to explore ways to measure the quality of MS data sets before subjecting them to the protein identification algorithms, thus allowing for more meaningful searches and increased confidence level of proteins identified.
The proposed method measures the qualities of MS data sets based on the symmetric property of b- and y-ion peaks present in a MS spectrum. Self-convolution on MS data and its time-reversal copy was employed. Due to the symmetric nature of b-ions and y-ions peaks, the self-convolution result of a good spectrum would produce a highest mid point intensity peak. To reduce processing time, self-convolution was achieved using Fast Fourier Transform and its inverse transform, followed by the removal of the "DC" (Direct Current) component and the normalisation of the data set. The quality score was defined as the ratio of the intensity at the mid point to the remaining peaks of the convolution result. The method was validated using both theoretical mass spectra, with various permutations, and several real MS data sets. The results were encouraging, revealing a high percentage of positive prediction rates for spectra with good quality scores.
We have demonstrated in this work a method for determining the quality of tandem MS data set. By pre-determining the quality of tandem MS data before subjecting them to protein identification algorithms, spurious protein predictions due to poor tandem MS data are avoided, giving scientists greater confidence in the predicted results. We conclude that the algorithm performs well and could potentially be used as a pre-processing for all mass spectrometry based protein identification tools.
Mass spectrometry (MS) is a common analytical technique used to identify unknown compounds, quantify known materials, and elucidate the molecular structure and chemical composition of organic and inorganic substances. A mass spectrometer is an instrument used to measure the mass-to-charge ratio of individual molecules that have been converted into electrically charged molecules, or ions . These ions are filtered and ordered from a lower to higher mass-to-charge ratio (m/z) before passing through an ion detector in the instrument . In the field of proteomic analysis, matrix assisted laser desorption ionisation (MALDI) and electrospray ionization (ESI) are two ionisation techniques generally used. Mass spectrometry is currently experiencing rapid growth in mass-spectrometry-based biomarker discovery and clinical proteomics, where hundreds of proteins can be sequenced quickly. As a consequence, large amounts of proteomics data are produced and made available to the public [3–5].
Although the generation of raw MS spectra has become easier, the analysis and identification of the data still post many challenges. Many protein identification tools have been developed, such as PEAKS  MASCOT [7, 8], Phenyx , SEQUEST  and OMSSA . In the case of high throughput proteomics, it involves the analysis of hundreds of thousands of peptide spectra derived from biological samples. Four general types of algorithms can identify these spectra,
Probability-based matching that calculates a score based on the statistical significance of a match between an observed peptide fragment and those calculated from a sequence search library [7, 19–22].
Cross-correlation methods and probability-based matching are two well-received methods for protein identification. In these methods, a theoretical mass spectra database is first generated from known protein sequences. To search this database with experimental spectra, the correlation of the experimental and theoretical spectra is calculated. Based on the statistical properties of the protein database and the correlation values (actual implementation is more complex), a score is given for the matched spectra.
Most of these tools have attained a certain degree of success thus far; nevertheless reliable protein identification using these methods is still a time-consuming and program-dependent task. A considerable frequency of false positive protein identifications has been reported from independent studies [23, 24]. Knowing that the quality of mass spectra is crucial in protein identification, several attempts to address the issue have been made using some information obtained from mass spectra generated by fragmented peptides [25–28]. In particular, Purvine et al  used a prefilter with three features for tandem MS spectra classification; one feature addressed the uncertainty in charge state assignments, the second was based on total signal intensity and the third on a signal-to-noise estimate. They obtained good results by adjusting these features. Although these approaches have been useful, we introduce an additional prefilter feature based on the symmetry property of the b- and y-ions, to compliment and improve the pre-filter process.
where f j and g j are two time series data sets. Self-convolution refers to convolution applied onto the same data series, where gi-jis the time-reversal copy of the data series fj.
Self-convolution has been used in many applications, where symmetry property is key feature of the signal, such as those found in the field of digital communication  and image processing . We will show in this work that MS do have such property inherited naturally from the fragmentation process, and hence the same approach can be used to extract information from the spectra. The success of this method depends on the availability of the complementary b- and y-ions, which are the two types of most commonly found ions in the conventional tandem mass spectrometry.
Peptide fragmentation is a process where peptide fragment ions are generated by dissociation in an ion trap of a mass spectrometer. In this process, the breakage can occur between any bonds in the peptide, but commonly occurs at the peptide bond. When a peptide is fragmented at a single peptide bond between the carbonyl and nitrogen, two fragments are formed. In the case where one peptide fragment retains the positive charge at the C-terminus of the peptide ion, it is called a y-ion. If the fragment retains the positive charge at the N-terminus, it is known as a b-ion. When a singly charged peptide is fragmented, the charge is retained only at one terminus and only the fragment containing the charge is detected while the other fragment is lost as a neutral fragment. Doubly charged peptides tend to produce two singly charged ions, though sometimes doubly charged ions can also be formed.
The development of chemical theory of peptide fragmentation [39, 40] has enabled the de novo prediction of fragmentation spectra from peptide sequences. Using a kinetic model, Zhang made the first successful attempt at predicting the low-energy CID spectra of singly and doubly charged peptides . Elias et al.  were first to successfully utilize a set of well-annotated fragmentation spectra acquired from an electrospray ion-trap mass spectrometer in an attempt to infer the probabilistic rules of fragmentation. More recently, Randy et al. used machine-learning algorithm to predict various fragment-ion types of doubly and triply charged precursor ions by learning peptide fragmentation rules in mass spectrometry in the form of posterior probabilities . Yu et al. proposed a novel method to automatically learn the factors influencing fragmentation from a training set of tandem MS spectra . Despite the availability of the various prediction models, it is unclear how these models could be used for predicting fragment ions in different types of mass spectrometry machines.
Scoring of theoretical mass spectrum under different conditions
Protein Sequence: MTDQEAIQDLWQWR
Mid-point peak value
Average of 20 peaks
Test Section A
White Gaussian noise level = 0%
White Gaussian noise level = 5%
White Gaussian noise level = 10%
White Gaussian noise level = 15%
White Gaussian noise level = 20%
White Gaussian noise level = 25%
White Gaussian noise level = 30%
Test Section B
add 10 random peaks, noise level 1
add 20 random peaks, noise level 1
Test Section C
b-ions peaks reduced by 10%, noise level 1
b-ions peaks reduced by 20%, noise level 1
b-ions peaks reduced by 30%, noise level 1
b-ions peaks reduced by 40%, noise level 1
b-ions peaks reduced by 50%, noise level 1
b-ions peaks reduced by 60%, noise level 1
b-ions peaks reduced by 70%, noise level 1
b-ions peaks reduced by 80%, noise level 1
Test Section D
y-ions peaks reduced by 10%, noise level 1
y-ions peaks reduced by 20%, noise level 1
y-ions peaks reduced by 30%, noise level 1
y-ions peaks reduced by 40%, noise level 1
y-ions peaks reduced by 50%, noise level 1
y-ions peaks reduced by 60%, noise level 1
y-ions peaks reduced by 70%, noise level 1
y-ions peaks reduced by 80%, noise level 1
Test Section E
minus 2 b-ions peaks, noise level 1
minus 4 b-ions peaks, noise level 1
minus 6 b-ions peaks, noise level 1
minus 8 b-ions peaks, noise level 1
minus 10 b-ions peaks, noise level 1
Test Section F
minus 2 y-ions peaks, noise level 1
minus 4 y-ions peaks, noise level 1
minus 6 y-ions peaks, noise level 1
minus 8 y-ions peaks, noise level 1
minus 10 y-ions peaks, noise level 1
Quantitative measurement of theoretical tandem MS spectra
We first compute the quality score (QS) on theoretical MS spectra based on our derivation shown in Eq. 1. The protein sequence [MTDQEAIQDLWQWR] was chosen arbitrary to form the theoretical spectra for our work. The theoretical spectra are subjected to different degradation processes, including introduction of white Gaussian noise, reduction in ion peak intensities, removal of ion peaks, as describe in the Method section. The test results are tabulated in Table 1.
In the first test, we included all the theoretical b and y-ions peaks in the spectrum, with white Gaussian noise (noise with normal distribution) of different amplitudes added. The scores are captured in Section A of Table 1. We observed that the QS scores remain stable for noise amplitudes between 0 and 10% of the peak intensity.
In the second test, we added in random peaks of equal amplitude to the b and y-ions in addition to the white Gaussian noise. The random peaks could represent spurious ion peaks intended to degrade the quality of the spectrum. We observed that with 10 and 20 random peaks added, the scores are not much affected, with QS equal to 4.6511 and 4.6442 respectively. This shows that the scores are not much affected by the random peaks, as long as the b and y-ions are intact.
Lastly, we removed randomly some of the b or y-ion peaks to simulate loss of certain ion fragments. The number of ions removed varies from 2 to 8 and we observed that the QS drop from 4.7692 to 2.9114 and from 3.9813 to 2.2562 for b-ion and y-ion loss respectively, as shown in Section E and Section F of Table 1. As the number of ion peak is further reduced, the mid-point peak is no longer detectable. These tests show the relation between the qualities of the spectrum to the QS that we established to assess the quality of the MS.
Qualitative measurement of experimental tandem MS spectra
The fragmentation of peptide sequence using conventional mass spectrometer produces spectra consists mostly of b and y-ion peaks. The quality of the mass spectra depends therefore mainly on the presence of the b- and the y-ions in the spectra. Current state-of-the-art database search tools depend heavily on these ion peaks and the lack of such peaks would lead to no protein match, or in the worst case, the erroneous matching of proteins in the database. Some database search algorithms allow inclusion of a- and/or z-ions; such inclusion makes the search more complex and computationally intensive, hence significantly slows down the protein identification process.
We proposed a novel method where the quality of the mass spectrum is determined from self-convolution of the mass spectra. This approach complements existing methods in selecting good quality tandem MS spectra to be processed by database search and/or de novo sequencing. This method is unique, as it does not depend on the charge of the fragmented ion, nor its length. Random peaks such as those produced by machine noise or contaminants (e.g. Keratin), irregardless of its intensity will not affect the process, as it requires a complementary pair to work.
Knowing that the presence of a fair amount of complementary b- and y-ions constitute to good quality mass spectrum, we can be assured that by selecting spectra with high QS values, only good quality tandem MS are pre-filtered to be processed for protein identification.
We note that tandem MS spectra having non-complementary b and y-ions might score poorly using this approach. Examples of such spectra are those having large number of y-ions but only very few complementary b-ions, and vice versa.
We conclude that the new approach is effective and useful in assessing the quality of tandem mass spectrum by analysing the self-convolution result of the spectra. This method relies mainly on the symmetry property inherited from the formation of complementary b and y-ions found in the tandem MS spectra. The proposed assessment scheme can be used to complement existing pre-filter/assessment processes to ensure that only good quality spectra are sent for protein identification process, reducing false positive protein detection by database search and de novo sequencing protein identification tools. This method can be further improved by taking into consideration of other complementary ions, such as a-ions and x-ions.
We proposed a method that exploits the naturally inherited symmetry property of tandem mass spectrum. The symmetry property of the spectra formed by the combination of b- and y-ions can be observed easily from the spectrum shown in Fig. 2. The m/z difference between b1 and b2 is equivalent to that which is between y8 and y7 as they represent the same amino acid 'Alanine', at 71.04 Dalton. Likewise, the m/z difference between b2 and b3 is equivalent to that which is between y7 and y6 as they represent the same amino acid 'Glycine', at 57.02 Dalton, and so on. This observed symmetry is a very useful feature as it can be used to determine the quality of the spectrum generated from the mass spectrometer. If a given spectrum contains all the b-ions and y-ions of a peptide, the self-convolution of the mass spectrum would be produced the highest peak when all the corresponding b-ions and y-ions peaks are aligned. For example, for the spectrum shown in Fig. 2, the highest peak would occur when y7, y6, y5, y4, y3, y2 correspond to b2, b3, b4, b5, b6, b7 are aligned on the m/z axis. This peaks occurs theoretically at the mid-point of the self-convolution results.
To verify the observation, the molecular weights of the theoretical b- and y-ions were generated for peptide sequence [MTDQEAIQDLWQWR], using MS-Digest .
The b-ions thus obtained are:
b = [233.10, 348.12, 476.18, 605.22, 676.26, 789.35, 917.40, 1032.43, 1145.51, 1331.59, 1459.65, 1645.73];
The y-ions generated are:
y = [1688.80, 1587.76, 1472.73, 1344.67, 1215.63, 1144.59, 1031.51, 903.45, 788.42, 675.34, 489.26, 361.20, 175.12];
D = DFT(data); // compute the Discrete Fourier Transform from the spectrum
D = Df * Df; // compute the product of the DFT
DD(1:10) = 0; // remove the near-DC components from the spectrum
IDD = abs(iDFT(DD)); // compute the amplitude of the inverse Discrete
// Fourier Transform
NIDD = IDD/max(IDD); // normalised self-convolution value
Determine the maximum peak value occurs at the mid-point of the normalised self-convolution values (Pmax(mid - point)) within the +/- 2 Dalton error windows of the MS fragment ion mass values.
Find the N highest peaks to the left of (P L ) and N highest peaks to the right of (P R ) the mid-point peak value. The choice of N value ranges from 10 to 30, depending on the mono-isotopic peptide precursor mass of the fragment.
Calculate the ratio of the maximum mid-point peak to the average of the highest peaks to the left and right of the mid-point peak.
Availability and requirements
Project name: MS Quality Assessment
Operating system(s): UNIX or Windows
Programming language: MATLAB version 5.3, no special toolbox needed.
Licence: Email request to author.
Any restrictions to use by non-academics: Licence needed.
We would like to thank Prof Kon Oi Lian, National Cancer Centre of Singapore, for providing us the experimental tandem mass spectra to make this work possible, and to thank Nanyang Polytechnic for providing financial and equipment support.
- What is mass spectrometry?. [http://www.asms.org/whatisms]
- Herbert CG, Johnstone RAW: Mass spectrometry basics. 2003, CRC Press LLC, Boca Raton, FLGoogle Scholar
- Puymbrouck JV, Angulo D, Drew K, Hollenbeck LA, Battre D, Schilling A, Jabon D, Laszewski GV: A batch import module for an empirically derived mass spectral database. DePaul CTI Technical report. 2006Google Scholar
- Desiere F, Deutsch EW, King NL, Nesvizhskii AI, Mallick P, Eng J, Chen S, Eddes J, Loevenich SN, Aebersold R: The peptideatlas project. Nucleic Acids Res. 2006, 34: D655-PubMed CentralView ArticlePubMedGoogle Scholar
- Kinter M, Sherman NE: Protein sequencing and identification using mass spectrometry. Wiley-Interscience. 2000, New YorkGoogle Scholar
- Ma Bin, Zhang Kaizhong, Hendrie Christopher, Liang Chengzhi, Li Ming, Doherty-Kirby Amanda, Lajoie Gilles: PEAKS: Powerful software for peptide de novo sequencing by ms/ms. Rapid Communications in Mass Spectrometry. 2003, 17 (20): 2337-2342.View ArticlePubMedGoogle Scholar
- Perkins DN, Pappin DJ, Creasy DM, Cottrell JS: Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis I. 1994, 20: 3551-3567.View ArticleGoogle Scholar
- MASCOT by Matrixscience. [http://www.matrixscience.com/home.html]
- Phenyx by Genebio. [http://www.phenyx-ms.com/]
- Eng Jimmy, McCormack Ashley, Yates John: An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J Am Soc Mass Spectrom. 1999, 5: 976-989.View ArticleGoogle Scholar
- Geer LY, Markey SP, Kowalak JA, Wagner L, Xu M, Maynard DM, Yang X, Shi W, Bryant SH: Open mass spectrometry search algorithm. J Proteome Res. 2004, 3: 958-964.View ArticlePubMedGoogle Scholar
- Johnson RS, Taylor JA: Searching sequence databases via de novo peptide sequencing by tandem mass spectrometry. Mol Biotechnol. 2002, 146: 41-61.Google Scholar
- Shevchenko A, Sunyaev S, Loboda A, Shevchenko A, Bork P, Ens W, Standing KG: Charting the proteomes of organisms with unsequenced genomes by MALDI-quadrupole time-of-flight mass spectrometry and BLAST homology searching. Anal Chem. 2001, 73 (9): 1917-1926.View ArticlePubMedGoogle Scholar
- Mann M, Wilm M: Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Anal Chem. 1994, 66 (24): 4390-4399.View ArticlePubMedGoogle Scholar
- Sunyaev S, Liska AJ, Golod A, Shevchenko A, Shevchenko A: Multitag: Multiple error-tolerant sequence tag search for the sequence-similarity identification of proteins by mass spectrometry. Anal Chem. 2003, 75 (6): 1307-1315.View ArticlePubMedGoogle Scholar
- Tabb DL, Saraf S, Yates JR: Gutentag: High-throughput sequence tagging via an empirically derived fragmentation model. Anal Chem. 2003, 75 (23): 6415-6421.PubMed CentralView ArticlePubMedGoogle Scholar
- Eng JK, McCormack AL, Yates JR: An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J Am Soc Mass Spectrom. 1994, 5: 976-989.View ArticlePubMedGoogle Scholar
- Pevzner PA, Dancik V, Tang CL: Mutation-tolerant protein identification by mass spectrometry. J Comput Biol. 2000, 7: 777-787.View ArticlePubMedGoogle Scholar
- Field HI, Fenyo D, Beavis RC: A bioinformatics solution that automates proteome mass spectral analysis, optimises protein identification, and archives data in a relational database. Proteomics. 2002, 2: 36-47.View ArticlePubMedGoogle Scholar
- Clauser KR, Baker P, Burlingame AL: Role of accurate mass measurement (+/- 10 ppm) in protein identification strategies employing ms or ms/ms and database searching. Anal Chem. 1999, 71: 2871-2882.View ArticlePubMedGoogle Scholar
- Fenyo D, Qin J, Chait BT: Protein identification using mass spectrometric information. Electrophoresis. 1998, 19: 998-1005.View ArticlePubMedGoogle Scholar
- Zhang N, Aebersold R, Schwikowski B: ProbID: A probabilistic algorithm to identify peptides through sequence database searching using tandem mass spectral data. Proteomics. 2002, 2: 1406-View ArticlePubMedGoogle Scholar
- Cargile BJ, Bundy JL, Stephenson JL: Potential for false positive identifications from large databases through tandem mass spectrometry. J Proteome Res. 2004, 3: 1082-1085.View ArticlePubMedGoogle Scholar
- Keller Andrew, Purvine Samuel, Nesvizhskii Alexey, Stolyar Sergey, Goodlett David, Kolker Eugene: Experimental protein mixture for validating tandem mass spectral analysis. OMICS: A Journal of Integrative Biology. 2002, 6: 207-212.View ArticlePubMedGoogle Scholar
- Jussi Salmi, Robert Moulder, Jan-Jonas Filen, Olli Nevalainen S, Tuula Nyman A, Riitta Lahesmaa, Tero Aittokallio: Quality classification of tandem mass spectrometry. Bioinformatics Journal. 2006, 22 (4): 400-406.View ArticleGoogle Scholar
- Fang-Xiang Wu, Pierre Gagné, Arnaud Droit, Guy Poirier G: Quality assessment of peptide tandem mass spectra. First International Multi-Symposiums on Computer and Computational Sciences. 2006, 1: 243-250.Google Scholar
- Samuel Purvine, Natali Kolker, Eugene Kolker: Spectral quality assessment for high-throughput tandem mass spectrometry proteomics. OMICS A Journal of Integrative Biology. 2004, 8 (3): 255-256.View ArticleGoogle Scholar
- Bern Marshall, Goldberg David, Hayes McDonald W, Yates John: Automatic quality assessment of peptide tandem mass spectra. Bioinformatics Journal. 2004, 20 (Suppl 1): i49-i54.View ArticleGoogle Scholar
- Yik-Chung Wu, Tung-Sang Ng: Symbol timing recovery for GMSK modulation based on square algorithm. IEEE Comm Lett. 2001, 5 (5): 221-223.View ArticleGoogle Scholar
- Bharath AA: A tiling of phase-space through self convolution. IEEE Transactions on Signal Processing. 2000, 48: 3581-3585.View ArticleGoogle Scholar
- Roepstorff P, Fohlman J: Proposal for a common nomenclature for sequence ions in mass spectra of peptides. Biomed Mass Spectrom. 1984, 11 (11): 601-View ArticlePubMedGoogle Scholar
- Johnson Richard, Martin Stephen, Biemann Klaus, Stults John, Throck Watson J: Novel fragmentation process of peptides by collision-induced decomposition in a tandem mass spectrometer: Differentiation of leucine and isoleucine. Anal Chem. 1987, 59 (21): 2621-2625.View ArticlePubMedGoogle Scholar
- Biemann K: Contributions of mass spectrometry to peptide and protein structure. Biomed Environ Mass Spectrom. 1988, 16 (1–12): 99-111.View ArticlePubMedGoogle Scholar
- Biemann K: Mass spectrometry. Methods in Enzymology. Edited by: McCloskey JA. 1990, San Diego: Academic Press, 193: 886-887.Google Scholar
- McCormack AL, Jones JL, Wysocki VH: Surface-induced dissociation of multiply-protonated peptides. J Am Soc Mass Spectrom. 1992, 3: 859-862.View ArticlePubMedGoogle Scholar
- Barbacci DC, Russell DH: Sequence and side-chain specific photofragment (193 nm) ions from protonated substance-p by matrix-assisted laser desorption ionization time-of-flight mass spectrometry. J Am Soc Mass Spectrom. 1999, 10: 1038-1040.View ArticleGoogle Scholar
- Zubarev RA, Kelleher NL, McLafferty FW: Electron capture dissociation of multiply charged protein cations. A nonergodic process. J Am Chem Soc. 1998, 120 (13): 3265-3266.View ArticleGoogle Scholar
- Syka JE, Coon JJ, Schroeder MJ, Shabanowitz J, Hunt DF: Peptide and protein sequence analysis by electron transfer dissociation mass spectrometry. Proc Natl Acad Sci USA. 2004, 101: 9528-9533.PubMed CentralView ArticlePubMedGoogle Scholar
- McCormack AL, Somogyi A, Dongre AR, Wysocki VH: Surface-induced dissociation in conjunction with a quantum mechanical approach. Anal Chem. 1993, 65: 2859-2872.View ArticlePubMedGoogle Scholar
- Wysocki VH, Tsaprailis G, Smith LL, Breci LA: Mobile and localized protons: a framework for understanding peptide dissociation. J Mass Spectrom. 2000, 35: 1399-1406.View ArticlePubMedGoogle Scholar
- Zhang Z: Prediction of low-energy collision-induced dissociation spectra of peptides. Anal Chem. 2004, 76: 3908-3922.View ArticlePubMedGoogle Scholar
- Elias JE, Gibbons FD, King OD, Roth FP, Gygi SP: Intensity-based protein identification by machine learning from a library of tandem mass spectra. Nat Biotechnol. 2004, 22 (2): 214-219.View ArticlePubMedGoogle Scholar
- Arnold Randy, Jayasankar Narmada, Aggarwal Divya, Tang Haixu, Radivojac Predrag: A machine learning approach to predicting peptide fragmentation spectra. Pacific Symposium on Biocomputing. 2006, 11: 219-230.Google Scholar
- Yu C, Lin Y, Sun S, Cai J, Zhang J, Bu D, Zhang Z, Chen R: An iterative algorithm to quantify factors influencing peptide fragmentation during tandem mass spectrometry. J Bioinform Comput Biol. 2007, 5 (2): 297-311.View ArticlePubMedGoogle Scholar
- MS-Digest. [http://prospector.ucsf.edu/prospector/4.27.1/cgibin/msForm.cgi?form=msdigest]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.