Tandem mass spectrometry data quality assessment by self-convolution

Background Many algorithms have been developed for deciphering the tandem mass spectrometry (MS) data sets. They can be essentially clustered into two classes. The first performs searches on theoretical mass spectrum database, while the second based itself on de novo sequencing from raw mass spectrometry data. It was noted that the quality of mass spectra affects significantly the protein identification processes in both instances. This prompted the authors to explore ways to measure the quality of MS data sets before subjecting them to the protein identification algorithms, thus allowing for more meaningful searches and increased confidence level of proteins identified. Results The proposed method measures the qualities of MS data sets based on the symmetric property of b- and y-ion peaks present in a MS spectrum. Self-convolution on MS data and its time-reversal copy was employed. Due to the symmetric nature of b-ions and y-ions peaks, the self-convolution result of a good spectrum would produce a highest mid point intensity peak. To reduce processing time, self-convolution was achieved using Fast Fourier Transform and its inverse transform, followed by the removal of the "DC" (Direct Current) component and the normalisation of the data set. The quality score was defined as the ratio of the intensity at the mid point to the remaining peaks of the convolution result. The method was validated using both theoretical mass spectra, with various permutations, and several real MS data sets. The results were encouraging, revealing a high percentage of positive prediction rates for spectra with good quality scores. Conclusion We have demonstrated in this work a method for determining the quality of tandem MS data set. By pre-determining the quality of tandem MS data before subjecting them to protein identification algorithms, spurious protein predictions due to poor tandem MS data are avoided, giving scientists greater confidence in the predicted results. We conclude that the algorithm performs well and could potentially be used as a pre-processing for all mass spectrometry based protein identification tools.

stances. A mass spectrometer is an instrument used to measure the mass-to-charge ratio of individual molecules that have been converted into electrically charged molecules, or ions [1]. These ions are filtered and ordered from a lower to higher mass-to-charge ratio (m/z) before passing through an ion detector in the instrument [2]. In the field of proteomic analysis, matrix assisted laser desorption ionisation (MALDI) and electrospray ionization (ESI) are two ionisation techniques generally used. Mass spectrometry is currently experiencing rapid growth in mass-spectrometry-based biomarker discovery and clinical proteomics, where hundreds of proteins can be sequenced quickly. As a consequence, large amounts of proteomics data are produced and made available to the public [3][4][5].
Although the generation of raw MS spectra has become easier, the analysis and identification of the data still post many challenges. Many protein identification tools have been developed, such as PEAKS [6] MASCOT [7,8], Phenyx [9], SEQUEST [10] and OMSSA [11]. In the case of high throughput proteomics, it involves the analysis of hundreds of thousands of peptide spectra derived from biological samples. Four general types of algorithms can identify these spectra, 1. De novo calling of the sequence directly from the spectrum [6,12,13].
3. Cross-correlation methods that correlate experimental spectra with theoretical spectra [17,18]. 4. Probability-based matching that calculates a score based on the statistical significance of a match between an observed peptide fragment and those calculated from a sequence search library [7,[19][20][21][22].
Cross-correlation methods and probability-based matching are two well-received methods for protein identification. In these methods, a theoretical mass spectra database is first generated from known protein sequences. To search this database with experimental spectra, the correlation of the experimental and theoretical spectra is calculated. Based on the statistical properties of the protein database and the correlation values (actual implementation is more complex), a score is given for the matched spectra.
Most of these tools have attained a certain degree of success thus far; nevertheless reliable protein identification using these methods is still a time-consuming and pro-gram-dependent task. A considerable frequency of false positive protein identifications has been reported from independent studies [23,24]. Knowing that the quality of mass spectra is crucial in protein identification, several attempts to address the issue have been made using some information obtained from mass spectra generated by fragmented peptides [25][26][27][28]. In particular, Purvine et al [27] used a prefilter with three features for tandem MS spectra classification; one feature addressed the uncertainty in charge state assignments, the second was based on total signal intensity and the third on a signal-to-noise estimate. They obtained good results by adjusting these features. Although these approaches have been useful, we introduce an additional prefilter feature based on the symmetry property of the b-and y-ions, to compliment and improve the pre-filter process.

Convolution
Convolution is a mathematical operation commonly used in digital signal processing (DSP). For discrete time series, the convolution is given as: where f j and g j are two time series data sets. Self-convolution refers to convolution applied onto the same data series, where g i-j is the time-reversal copy of the data series f j.
Self-convolution has been used in many applications, where symmetry property is key feature of the signal, such as those found in the field of digital communication [29] and image processing [30]. We will show in this work that MS do have such property inherited naturally from the fragmentation process, and hence the same approach can be used to extract information from the spectra. The success of this method depends on the availability of the complementary b-and y-ions, which are the two types of most commonly found ions in the conventional tandem mass spectrometry.

Peptide fragmentation
Peptide fragmentation is a process where peptide fragment ions are generated by dissociation in an ion trap of a mass spectrometer. In this process, the breakage can occur between any bonds in the peptide, but commonly occurs at the peptide bond. When a peptide is fragmented at a single peptide bond between the carbonyl and nitrogen, two fragments are formed. In the case where one peptide fragment retains the positive charge at the C-terminus of the peptide ion, it is called a y-ion. If the fragment retains the positive charge at the N-terminus, it is known as a b-ion. When a singly charged peptide is fragmented, the charge is retained only at one terminus and only the h fg i i i j j m = − = ∑ 0 fragment containing the charge is detected while the other fragment is lost as a neutral fragment. Doubly charged peptides tend to produce two singly charged ions, though sometimes doubly charged ions can also be formed.
The types of fragment ions observed in an tandem MS spectrum depend on many factors, including primary peptide sequence, amount of internal energy and how the energy was introduced, charge state, etc. The accepted nomenclature for fragment ions was first proposed by Roepstorff and Fohlman [31], and subsequently modified by Johnson et al [32] and Biemann [33,34]. There are different dissociation methods available, including commonly used gas phase collision-induced dissociation (CID) [33], surface-induced dissociation [35], photodissociation [36], electron-capture dissociation [37], and electron transfer dissociation [38]. The b-ions and y-ions are usually formed when fragmentation occurs under low energy conditions. Fig. 1 shows all possible breakage points along a peptide bond.
Other ions like a-ions and x-ions, which form a complementary pair, and c-ions and z-ions, which form another complementary pair, are also formed. The a-ions and xions are formed when the peptide fragments between the amino acid side chain and the carbonyl molecule. The cions and z-ions are formed when the peptide fragments between the nitrogen and the amino acid side chain molecule. These ions are formed when fragmentation occurs high-energy conditions since higher amounts of energy are required to break these bonds. Fig. 2 shows a typical tandem MS spectrum.
The development of chemical theory of peptide fragmentation [39,40] has enabled the de novo prediction of fragmentation spectra from peptide sequences. Using a kinetic model, Zhang made the first successful attempt at predicting the low-energy CID spectra of singly and doubly charged peptides [41]. Elias et al. [42] were first to suc-cessfully utilize a set of well-annotated fragmentation spectra acquired from an electrospray ion-trap mass spectrometer in an attempt to infer the probabilistic rules of fragmentation. More recently, Randy et al. used machinelearning algorithm to predict various fragment-ion types of doubly and triply charged precursor ions by learning peptide fragmentation rules in mass spectrometry in the form of posterior probabilities [43]. Yu et al. proposed a novel method to automatically learn the factors influencing fragmentation from a training set of tandem MS spectra [44]. Despite the availability of the various prediction models, it is unclear how these models could be used for predicting fragment ions in different types of mass spectrometry machines.

Results
To validate the proposed method of tandem MS spectra assessment, we conducted series of tests on theoretical MS spectra as well as experimental MS spectra. The results of the tests on theoretical MS spectra are tabulated in Table  1. We then used another 60 sets experimental tandem MS spectra to tests its effectiveness and robustness.

Quantitative measurement of theoretical tandem MS spectra
We first compute the quality score (QS) on theoretical MS spectra based on our derivation shown in Eq. 1. The protein sequence [MTDQEAIQDLWQWR] was chosen arbitrary to form the theoretical spectra for our work. The theoretical spectra are subjected to different degradation processes, including introduction of white Gaussian noise, reduction in ion peak intensities, removal of ion peaks, as describe in the Method section. The test results are tabulated in Table 1. In the first test, we included all the theoretical b and y-ions peaks in the spectrum, with white Gaussian noise (noise with normal distribution) of different amplitudes added. The scores are captured in Section A of Table 1. We observed that the QS scores remain stable for noise amplitudes between 0 and 10% of the peak intensity.
In the second test, we added in random peaks of equal amplitude to the b and y-ions in addition to the white Gaussian noise. The random peaks could represent spurious ion peaks intended to degrade the quality of the spectrum. We observed that with 10 and 20 random peaks added, the scores are not much affected, with QS equal to 4.6511 and 4.6442 respectively. This shows that the scores are not much affected by the random peaks, as long as the b and y-ions are intact.
In the next two test scenarios, we reduced the intensity of b and y-ions to simulate the lack of fragmented b and yions in the spectrum. As b-ions reduce in intensity, the QS Peptide fragmentation Figure 1 Peptide fragmentation. This figure shows various breakage points along a peptide bond and ions are formed in complementary to the N-terminal and C-terminal. drops from 4.5330 to 2.2654 at 10% to 70% reduction of the b-ion intensity, as shown in Section C in Table 1. The reduction of y-ion intensity shows similar effect on the QS score, it drops from 4.6106 to 0.5468 at 10% to 70% reduction in intensity, as shown in Section E in Table 1. The results are shown in Fig. 3. As the intensity is reduced further, there is no longer any peak detected at the midpoint window of the self-convolution result.
Lastly, we removed randomly some of the b or y-ion peaks to simulate loss of certain ion fragments. The number of ions removed varies from 2 to 8 and we observed that the QS drop from 4.7692 to 2.9114 and from 3.9813 to 2.2562 for b-ion and y-ion loss respectively, as shown in Section E and Section F of Table 1. As the number of ion peak is further reduced, the mid-point peak is no longer detectable. These tests show the relation between the qualities of the spectrum to the QS that we established to assess the quality of the MS.

Qualitative measurement of experimental tandem MS spectra
We started the quality assessment by simply performing a self-convolution on some of the experimental MS spectra. Fig. 4 shows a plot of the result of self-convolution of one of the raw tandem MS spectra. Although the plot does show a high peak at the mid-point window of the result, we found out that the product of two high intensity peaks happened incidentally to be at the mid-point. This could cause misinterpretation and therefore erroneous for us to consider this result as an indication of good quality spectrum. We have thus further improved on the approach by considering side peaks and normalisation process.
The proposed method was subsequently tested on 60 sets of real tandem MS spectra (unpublished). They were subjected to the QS scoring function described in the Eq. 1. We considered 15 highest intensity peaks to the left and right of the mid-point window of each spectrum. The selfconvolution result is shown in Fig. 5. The DC shifted selfconvolution plots of the original tandem MS spectrum is contrasted with that of the newly generated plot, as shown in Fig. 6. We have also assumed that 30 peaks are sufficient in our calculation, but this number can be increased in the case where more ion fragments are expected. All tandem mass spectra having high scores have been identified successfully using MASCOT [8] with high confidence (> 45).

Discussion
The fragmentation of peptide sequence using conventional mass spectrometer produces spectra consists mostly of b and y-ion peaks. The quality of the mass spectra depends therefore mainly on the presence of the b-and the y-ions in the spectra. Current state-of-the-art database search tools depend heavily on these ion peaks and the lack of such peaks would lead to no protein match, or in the worst case, the erroneous matching of proteins in the database. Some database search algorithms allow inclusion of a-and/or z-ions; such inclusion makes the search Tandem mass spectrum Figure 2 Tandem mass spectrum. This figure shows the possible fragmentation on the short peptide AVAGCAGAR and its respective intensity versus m/z mass spectrometry plot. In our work, we tested the qualitative measurement of the tandem mass spectra based on different noise intensities (Sec. A), additional spurious peaks (Sec. B), different b-ion intensities (Sec. C), different y-ion intensities (Sec. D), different percentage loss of b-ion (Sec. E), and different percentage loss of y-ion (Sec. F). We observed the drop in score as the quality of the theoretical mass spectrum deteriorates. more complex and computationally intensive, hence significantly slows down the protein identification process.
We proposed a novel method where the quality of the mass spectrum is determined from self-convolution of the mass spectra. This approach complements existing methods in selecting good quality tandem MS spectra to be processed by database search and/or de novo sequencing. This method is unique, as it does not depend on the charge of the fragmented ion, nor its length. Random peaks such as those produced by machine noise or contaminants (e.g. Keratin), irregardless of its intensity will not affect the process, as it requires a complementary pair to work.
Knowing that the presence of a fair amount of complementary b-and y-ions constitute to good quality mass spectrum, we can be assured that by selecting spectra with high QS values, only good quality tandem MS are pre-filtered to be processed for protein identification.
We note that tandem MS spectra having non-complementary b and y-ions might score poorly using this approach. Examples of such spectra are those having large number of y-ions but only very few complementary b-ions, and vice versa.

Conclusion
We conclude that the new approach is effective and useful in assessing the quality of tandem mass spectrum by analysing the self-convolution result of the spectra. This method relies mainly on the symmetry property inherited from the formation of complementary b and y-ions found in the tandem MS spectra. The proposed assessment scheme can be used to complement existing pre-filter/ assessment processes to ensure that only good quality spectra are sent for protein identification process, reducing false positive protein detection by database search and de novo sequencing protein identification tools. This method can be further improved by taking into consideration of other complementary ions, such as a-ions and xions.
Plot of self-convolution of experimental mass spectrum Figure 4 Plot of self-convolution of experimental mass spectrum. This figure shows the actual mass spectrum (left) and its respective self-convolution result (right). A high mid-point intensity might not indicate a good quality spectrum as a product of two high intensity peaks could generate it by chance. Figure 3 Plot of QS versus ion intensity reduction. This figure shows the effect of reduction in ion intensity on the QS score.

Methods
We proposed a method that exploits the naturally inherited symmetry property of tandem mass spectrum. The symmetry property of the spectra formed by the combina-tion of b-and y-ions can be observed easily from the spectrum shown in Fig. 2. The m/z difference between b 1 and b 2 is equivalent to that which is between y 8 and y 7 as they represent the same amino acid 'Alanine', at 71.04 Dalton.
DC-shifted self-convolution plot of experimental tandem MS Figure 6 DC-shifted self-convolution plot of experimental tandem MS. This figure shows the difference between the DCshifted self-convolution results obtained from the original mass spectrum (left) and the pre-processed mass spectrum (right).
Pre-processing of ion peaks intensities Figure 5 Pre-processing of ion peaks intensities. This figure shows a plot of the experimental tandem MS (left) and the newly generated mass spectrum after being pre-processed (right).
Likewise, the m/z difference between b 2 and b 3 is equivalent to that which is between y 7 and y 6 as they represent the same amino acid 'Glycine', at 57.02 Dalton, and so on. This observed symmetry is a very useful feature as it can be used to determine the quality of the spectrum generated from the mass spectrometer. If a given spectrum contains all the b-ions and y-ions of a peptide, the selfconvolution of the mass spectrum would be produced the highest peak when all the corresponding b-ions and yions peaks are aligned. For example, for the spectrum shown in Fig. 2, the highest peak would occur when y 7 , y 6 , y 5 , y 4 , y 3 , y 2 correspond to b 2 , b 3 , b 4 , b 5 , b 6 , b 7 are aligned on the m/z axis. This peaks occurs theoretically at the midpoint of the self-convolution results.
To verify the observation, the molecular weights of the theoretical b-and y-ions were generated for peptide sequence [MTDQEAIQDLWQWR], using MS-Digest [45]. A time series data is then created such that the starting mass is 0 Dalton and the ending mass is 1819.84 Dalton, which is the mono-isotopic peptide precursor mass (MH+), with an interval of 0.01 Da. The following conditions are used to set the intensity of the time series data: A plot of these b-ions and y-ions and the self-convolution values are shown in the Fig. 7. From this figure, we observed a high peak occurs at the mid-point of the selfconvolution, where the b-ions (b n , b n-1 , b n-2 , ... b 2 ) align with corresponding y-ions (y 2 , y 3 , y 4 , ... y n ). However, it is also noted that the cumulating sum of the product of all the points steadily increases from 0 to the mid-point and reducing thereof, forming a triangle below the peaks. This is potentially damaging to the detection of the peaks especially when significant noise levels are present, compounded by low intensity of b-ions and/or y-ions peaks and missing peaks, as we will demonstrate later. To determine the effects of increasing noise levels, we change the noise level to 10 as shown below.
We observe that, while the noise level is only 10% of the ions intensity as shown in Fig. 8 peak is significantly reduced in comparison to the increased overall overlapping convolution values. The other observable peaks in Fig. 7 are also lost in view of the greatly increased overlapping convolution values due to augmented in noise levels. This problem can be resolved by applying convolution theorem and by removing the DC component of the product of Fourier transforms before performing the inverse Fourier transform. According to Convolution Theorem, convolution is achieved by first applying the Discrete Fourier Transform (DFT) onto the data sets, multiply these two transforms, and then perform the inverse DFT. The key point is that the near DC components are removed by setting the first 10 points of the DFT product to 0. Finally the data is normalised against its largest magnitude. The pseudo-codes are shown as below: As depicted in Fig. 9, we have eliminated the detrimental effects of noise by preserving the maximum peak at the mid point and the other observable peaks as compared with Fig. 8. The removal of near DC component and an additional normalization step have improved our ability to determine the quality of the spectrum.

Quantitative measurement
We further propose a quantitative method to determine the quality of a given tandem MS spectrum from the selfconvolution values, as follows: 1) Determine the maximum peak value occurs at the midpoint of the normalised self-convolution values (P max(midpoint) ) within the +/-2 Dalton error windows of the MS fragment ion mass values.
2) Find the N highest peaks to the left of (P L ) and N highest peaks to the right of (P R ) the mid-point peak value. The choice of N value ranges from 10 to 30, depending on the mono-isotopic peptide precursor mass of the fragment.
3) Calculate the ratio of the maximum mid-point peak to the average of the highest peaks to the left and right of the mid-point peak.
Self-convolution plot for noise amplitude = 10 We term this ratio as the Quality Score (QS) of the tandem MS spectrum as shown in the following equation: Fig. 10 shows the actual components considered in our quantitative method described above. Fig. 11 shows the normalised self-convolution plot of a good tandem mass spectrum. We can see clearly that the score is higher (QS = 3.0833) in this case as compared to those shown in Fig Qualitative measurement of spectrum quality Figure 10 Qualitative measurement of spectrum quality.