Transformation and other factors of the peptide mass spectrometry pairwise peak-list comparison process

Background: Biological Mass Spectrometry is used to analyse peptides and proteins. A mass spectrum generates a list of measured mass to charge ratios and intensities of ionised peptides, which is called a peak-list. In order to classify the underlying amino acid sequence, the acquired spectra are usually compared with synthetic ones. Development of suitable methods of direct peak-list comparison may be advantageous for many applications. Results: The pairwise peak-list comparison is a multistage process composed of matching of peaks embedded in two peak-lists, normalisation, scaling of peak intensities and dissimilarity measures. In our analysis, we focused on binary and intensity based measures. We have modified the measures in order to comprise the mass spectrometry specific properties of mass measurement accuracy and non-matching peaks. We compared the labelling of peak-list pairs, obtained using different factors of the pairwise peak-list comparison, as being the same or different to those determined by sequence database searches. In order to elucidate how these factors influence the peak-list comparison we adopted an analysis of variance type method with the partial area under the ROC curve as a dependent variable. Conclusion: The analysis of variance provides insight into the relevance of various factors influencing the outcome of the pairwise peak-list comparison. For large MS/MS and PMF data sets the outcome of ANOVA analysis was consistent, providing a strong indication that the results presented here might be valid for many various types of peptide mass measurements.


Background
In recent years, mass spectrometry (MS) has emerged as a powerful technique to identify proteins in biological samples [1][2][3][4]. For their identification, proteins are usually cleaved into peptides by a protease of known and restricted cleavage specificity, e.g. trypsin. The resulting cleavage products can then be analysed by Peptide Mass Fingerprinting (PMF) [5] or subjected to MS/MS fragment ion analysis [6,7]. A PMF is a highly specific set of peptide molecular masses derived from one isolated protein.
PMFs are employed to identify the analysed protein in large protein sequence databases by matching the determined peptide molecular masses to values calculated from the amino acid sequences in the database. Similarly, MS/MS spectra serve for protein identification by comparing the determined peptide fragment ion masses against predicted ones from amino acid sequence data and fragmentation characteristics of the employed MS instrumentation [8].
Before performing database searches, the MS spectra are processed and the most informative features, namely the monoisotopic peaks are extracted. The procedure consists of several steps and includes smoothing, baseline subtraction, peak-extraction and monoisotopic peak determination [9,10]. A spectrum pre-processing usually requires proprietary software provided by the instrument vendors. It generates a list of mass over charge (m/z) values of the monoisotopic peaks and either the area under or the height of those peaks are obtained. The set of m/z and intensity value pairs is called a peak-list. In case of PMF datasets, the peak-lists have on average 36 peak/intensity pairs compared to e.g. 100,000 data points of the unprocessed spectra.
The sensitivity and specificity of the peptide identification using database searches might be increased by several methods. These usually include calibration [11][12][13], identification of non-peptide peaks [12,14,15], identification and removal of low-quality spectra [16,17] or validation of the search results using machine-learning algorithms [18,19].

The subtractive analysis technique
The sensitivity and specificity of peptide and protein identification can further be increased by the pairwise comparison of the peak-lists [11,[20][21][22]. Yates et al. [20] applied, as a measure of peak-list similarity, the cross-correlation score normalised by the auto-correlation of the spectra. They demonstrated that when using this measure, the MS spectra could be correctly classified according to their peptide content even if acquired on two different instruments, namely a Triple-Quadrupole Tandem (TSQ) or Quadrupole Ion Trap (LCQ) mass spectrometers.
They suggested, as part of a "subtractive analysis technique", using pairwise spectra comparisons to search MS spectra against a library of identified spectra before database searching. Tandem mass spectra (unique to an experiment) could be targeted for database searches or de novo interpretation.
A significant portion of identical peptides is analysed and identified many times even when the instrument control software attempts to prevent the repeated isolation and fragmentation of particular peptides in order to increase the diversity of acquired spectra. Gentzel et al. [11] used the cross-correlation measure for MS/MS spectra comparison. They computed the similarity score for two parts of the spectra. If these parts exhibited a satisfying similarity score, the spectra were assumed to be identical. Tabb et al. [22] explored the performance of the normalised dotproduct (spectral angle) algorithm to identify duplicated samples. The advantage of the dot-product measure over the cross-correlation algorithm lies in its computational speed. Based on this measure, Tabb et al. [22] and Beer et al. [21] developed software to identify the duplicated MS/ MS spectra. The analysis time saved by subtractive analysis can be used instead to perform more extensive searches in other databases, i.e. expressed sequence tag (EST) databases, or to apply computationally demanding, mutationtolerant search algorithms [23], which depend on partial spectra interpretation [24][25][26].
Pairwise spectra comparison can also be put to use "as an informative marker to identify organisms or some other feature of an organism" [20]. For example, Svetnik and Liaw [27] used pairwise spectra comparison to detect novel outliers in large-scale cosmid screening experiments [28]. They used the Pearson and Spearman correlation measures, as well as the Euclidean distance to compute the distances of the spectra, followed by sequential clustering. Serum protein and peptide fingerprints were used in diagnostic medicine to distinguish healthy individuals from those with cancer [29][30][31][32].

The pairwise peak-list comparison process
Previously, only the performance of the dot-product measure has been compared to the similarity index using MS/MS spectra of structural isomers [33]. Therefore, in our work we have reviewed a large group of dissimilarity measures and examined how these can be extended to include the mass spectrometry specific property of mass measurement accuracy. A new parameter weight of nonmatching peaks (θ) was introduced into the computation of distance measures. We have studied the Euclidean and the Manhattan distance, the covariance, the sum of agreeing intensities and the spectral angle. We have also examined the impact of the intensity scaling on the outcome of intensity based measures [34,35]. In addition, we have performed a systematic study of various intensity transformations [22] in order to determine the best variance stabilising transformation. Furthermore, we investigated quantitative measures, i.e. Huberts Γ or the relative mutual information measure [36]. The combination of these factors resulted in 96 choices of the comparison process for the binary measures and 2688 approaches for the intensity based measures. The first aim of the work presented here was to determine the pairwise peak-list comparison approach with highest sensitivity and specificity for the grouping of spectra. The second goal was to determine which factors studied had the highest effect on the outcome of the clustering, in order to foster the understanding of the pairwise peak-list comparison process. While the first goal could be easily achieved by ranking the various peak-list comparison approaches, the second goal was approached by analysis of variance (ANOVA) techniques. The partial area under the receiver operator characteristic (ROC) curve, determined for high sensitivity and specificity values was used as the dependent variable, while the various choices for the comparison process were the factors in the ANOVA.

Evaluation framework
PMF and MS/MS data represent mass spectrometric peptide peak-lists. While the PMF measurement is characterised by high-mass resolution, large mass range and the production of relatively few peaks, the MS/MS spectra have a lower-mass resolution, a smaller mass range and a higher number of peaks. Figure 1 presents examples of peak-lists for fragment ion MS/MS and PMF, respectively. We have analysed the pairwise comparison process using both datasets in order to determine if the differences of the data require different configuration of the pairwise comparison.
In order to determine the sensitivity and specificity of the measures used for classification, the grouping induced by the measures must be compared with the true cluster membership of the spectra. However, we did not have a dataset with a large number of groups of spectra and with identities known a-priori.
Therefore, in our study we used PMF and MS/MS spectra resulting from studies of different proteomes. The identities of the spectra were determined by database searches (cf. Methods). We assumed that the database searches resulted in true identification of the peptides and proteins.
In the PMF dataset, only 176 proteins out of 668 were identified by a single spectrum, while the remaining 492 protein identifications resulted from 2160 database searches. The amount of duplicated samples in the two datasets (Table 1) was significant and recognising them could for example, significantly reduce the number of searches necessary to identify all proteins.
According to the protein database identifier (ID), in case of the PMF data or the peptide sequence and the same parent ion charge z = 2 for MS/MS data, we defined a peaklists pair (X, Y) to be within a cluster if it was assigned to the same protein ID or peptide sequence. Similarly, two peak-lists were defined to lie between the clusters if their database IDs differed. This assignment of the peak-lists pairs as within and between was recognised as the true condition status. The assignment induced by the pairwise comparison approach for a given thresholds of the discriminatory variable was compared with this true condition status and the sensitivities and specificities were computed.
The spectra in the large group of unidentified peak-lists could not be used in this study. This is because we could not infer that all these spectra were derived from the same peptide/protein. Secondly, we could not assume that an un-identified spectrum was not obtained in a measurement of the same peptide as any of the spectra assigned to a database identifier or peptide sequence. The identification of a protein/peptide often fails because of a signal to noise ratio that is too small. If we would treat this group likewise the clusters formed by identified spectra we would then introduce an error during the determination of false positive/false negative rates.
Example of a peak-list stick spectrum for fragment ion MS/MS (top panel) and PMF(bottom panel) Figure 1 Example of a peak-list stick spectrum for fragment ion MS/MS (top panel) and PMF(bottom panel). X-axis -mass of the peaks, Y-axis -area under the peak.

MSMS
Using the data, where the identities of samples were determined by database search algorithms we were able to examine whether the pairwise peak-list comparison made equal or different assignments to a group, in relation to the database search algorithm. We were not able to disclose if any of these measures had higher sensitivities and specificities than the database search algorithms used. However, these data were sufficient to expose relative differences between pairwise peak-list comparison approaches, as well as the degree by which the various factors of the comparison process influenced the outcome.
The computation of the pairwise peak-list distances was performed using the in-house developed R [37] package msbase, which is available from the BioConductor Project [38] web page [39].

Results and Discussion
The factors of the pairwise peak-list comparison Table 2 summarises the factors, which can influence the outcome of a pairwise peak-list comparison. The first step in the comparison of the peak-lists is to determine matching and non-matching peaks with given mass measurement accuracy. If one peak is ambiguously assigned to several peaks in the second peak-list (cf. Methods - Figure  6), the non-crossing matching can be computed. The next element to be considered is whether the mass measurement accuracy should be modelled [45,46] using Equation 2. Modelling of the mass differences between matching masses did not affect the non-matching peaks. The influence of non-matching peaks on the pairwise peak-list comparison was modulated by increasing or decreasing their weight by two-fold (using the parameter θ in the dissimilarity equations (cf. Methods)).
The length of the aligned peak-lists either equals the sum of the peaks in both peak-lists minus the number of peaks matching or is user-defined. In Equation 3, we set N = 250 for the PMF dataset and N = 400 for the MS/MS dataset, which in both cases was approximately twice the length of the longest peak-list. In case of the intensity based measures the missing peak pairs were augmented by peaks of zero intensity. Further elements which affected only the intensity-based measures included the transformation and scaling of peak intensities. The distance measures (cf. Methods) were the last of the examined factors. Section 'Features of the pairwise peak-list comparison and their properties' in the Appendix section provides a descriptive analysis of the features of the pairwise peak-list comparison in order to introduce the data and to motivate the use of various factors of the peak-list comparison approach.
The intensity-based measures contained 2688 sets of factors while the binary measures covered 96 sets. To determine which of these factors were important and how they influenced the scores, we applied analysis of variance techniques (ANOVA). We first performed the ANOVA on the PMF dataset. Afterwards, we examined if the obtained linear model could be used to explain the properties of the pairwise peak-list comparison process computed on the MS/MS dataset.

The evaluation scores
In order to evaluate the capability of pairwise comparison approaches to identify peak-list pairs as being within or between cluster we used the partial area of interest under the Receiver Operating Characteristic (ROC) curve (PAUC) [44]. The ROC curve was generated by drawing the  Figure 2. For 4 matches we determined a specificity of 99% and sensitivity of 95%.
We were particularly interested in the sensitivity of the pairwise peak-list comparison only for small values of the FP-rate. Therefore, we computed the Partial Area under the ROC curve (PAUC) for 1 -specificity ∈ [0, 0.1] (red-dashed region Figure 2), denoted by sensitivity-PAUC. Moreover, we were also interested in the specificities when high sen- Table 1: Number of clusters of given cluster size N. The columns 2 and 3 describe the cluster size in the PMF-and the MS/MS datasets. Number of spectra -number of peak-lists submitted for database search, identified spectra -spectra assigned to a database ID with an either significant probability based Mowse score (PMF-data) or to a peptide sequence with Xcorr > 2, and an ion coverage > 20% (MS/MS-data) given a parent peptide charge z = 2. Identified proteins/peptides -the number of uniquely identified proteins or peptides. A -approximate number of spectra derived from ion fragments of peptides with charge z = 2. B -The number of spectra with charge z = 2 of the parent ion (≈ 53% of all identified spectra). sitivities are required (sensitivity ∈ [0.9, 1]), further abbreviated specificity-PAUC. Hence, we computed the PAUC for the area indicated in Figure (2) by the green-dashed region. Both sensitivity-PAUC and specificity-PAUC were utilised as the dependent variable in the analysis of variance.

ANOVA of the pairwise peak-list comparison approaches
The aim of the statistical analysis was to evaluate different strategies of the pairwise peak-list comparison with respect to the sensitivity and specificity partial area under the ROC curve (PAUC). A possible strategy to decide which pairwise comparison approach performs best would have been to choose the one with the largest partial area under the curve. However, we were also interested in determining the influence of and the dependence between the factors of the pairwise peak-list comparison on the outcome of the classification.
Each strategy of the pairwise peak-list comparison was defined by the combination of seven factors as given in Table 2. The whole set of pairwise peak-list comparison strategies shows a completely balanced factorial structure analogous to those used in analysis of variance (ANOVA) [42] with PAUCs as dependent variables and the specific strategies of pairwise peak-list comparison as combinations of factor levels. However, due to multi-modality our data could not be transformed to approximate normality. Thus, we could not calculate F-ratios and related statistical tests of significance. To assess the significance of the factors, we therefore used the relative sum of squares (%SSQ) and the relative mean sum of squares (%MSQ), defined as the ratio of the SSQ or MSQ with respect to the total SSQ or MSQ. We did not calculate F-ratios or P-values for factors and interactions, due to the mentioned deviation from normality. A large value of the relative sums of squares (%SSQ) for a factor in Tables 3 and 4 indicates its importance for the correct classification of peak-lists.

The ANOVA results
The high value of the %SSQ or %MSQ value in Table 3 and 4 reflect the change (variance) of the response variable PAUC caused by a factor or combination of factors. In case of the intensity based measures, the high value of the relative mean sum of squares (%MSQ) ( Table 4, top panel) for the factor 'scale' (intensity scaling procedure) and 'measure' (dissimilarity measure) indicates that these factors were crucial for the correct classification of peaklists. The small values of the %MSQ for the factors 'weight of match accuracy' (weight) and 'computing the noncrossing matching' (noncross) shows that these factors had a negligible impact on the result of pairwise peak-list comparison.
A large value of %MSQ or %SSQ of an interaction term (denoted by × in Table 3 and 4, bottom panel) demonstrates that some combinations of factors were more useful than others. The high value of %SSQ for the interaction measure × scale reflects that, for example, the measure sum of agreeing intensities performed better in combination with the vector length scaling (N) or rootmean-square scaling (S), than with the total ion count scaling (T) or with the z-score scaling (Z). We concluded that the crucial factors of the pairwise peak-list comparison were the measure and peak intensity scaling, followed  by the weight of non-matching peaks and the length of the peak-list.
In case of binary measures, the high %MSQ value ( Table  3, column %MSQ) of factors 'measure', 'weight of nonmatching peaks' θ, 'peak-list length' N as well as of their interactions indicates that their are crucial for the outcome of the pairwise comparison.
In order to examine the extend to which the properties of the pairwise peak-list comparison, determined for the PMF data set can be generalised to other types of mass spectrometric data, we applied the ANOVA to the MS/MS dataset (Tables 3 and 4, right panel). The main difference between these two datasets is that the computation of the non-crossing matching which influenced the PAUC scores in case of MS/MS data, (Tables 3 and 4, row noncross). However, one can conclude that the same factors and factor interactions are significant if comparing PMF and MS/ MS data.

Dissimilarity measures with small variance and high PAUC scores
The Figure 3 shows the boxplot of the sensitivity PAUCs measure (PMF data) itemised according to the factors explaining the largest variance. These included measure and weight of non matching peaks θ (binary measures, Figure 3A) and measure and scaling (intensity based measures, Figure 3B).
In case of binary measures ( Figure 3A) the largest PAUC scores were measured for the Fowlkes-Mallows statistics ( Figure 3, left panel) followed by the Gower coefficient.
Other factors had a negligible impact on the measures as the small height of the boxes indicates. In Figure 4A (PMF data) and 4B (MS/MS data) we compared the scores obtained by the asymmetric binary measures (Fowlkes Mallows statistics and Gower coefficient) with those acquired by the symmetric binary measures (Huberts Gamma and Relative Mutual information). The figure revealed that for MS/MS data, the symmetric measures performed better (right panel). The conclusion, which can be drawn from this observation is that for MS/MS data a lack of peaks at given masses was more significant than for PMF data. This is in agreement with the higher peak density of MS/MS peak-lists.
The boxplot ( Figure 3B) demonstrates that the PAUC scores computed using the Manhattan and Euclidean distances, exhibited higher overall variance, than the dot product measure and the sum of agreeing intensities. These measures (Manhattan and Euclidean distances) were influenced by the weighting of non-matching peaks θ and peak-list pair length N (cf. Methods, Equation 3).
These two factors did not influence the outcome of the dot-product measure, which measures only the similarity of matching peaks. However, for a fixed combination of measure and scaling, the Manhattan distance with the total ion count (l 1 -norm) scaling and the Euclidean distance with the vector norm (l 2 -norm) scaling presented an eminently small variance of the PAUC measure. This reduction of the variance occurred because the factor 'peak-list length' did not influence the outcome of the comparison. Notably good choices of intensity scaling in the case of the sum of agreeing intensities (soai) and the dot product (dotprod) measure were either the vector norm or the root mean square scaling. Remarkably, the widely used dot product measure did not achieved the highest PAUC score (top panel Figure 3B).
The analysis presented here reproduced published results, demonstrating that the relative distances (cf. Methods 12) performed worse than the dot product measure [33]. Furthermore, it identified other measures that performed equally or better than the dot product measure.
Receiver Operator Characteristic curve -The sensitivity (TP-rate) is plotted against FP = 1 -specificity using the number of matching peaks as the discriminatory variable

Intensity transformation and ANOVA
As the %MSQ scores of the ANOVA analysis reveal, the intensity transformation has a smaller impact on the peak-list comparison process if set in relation with the distance measure and scaling. However, the proper choice of the intensity transformation can increase the PAUC scores. The Boxplot ( Figure 5) of the sensitivity-PAUC (left) and specificity-PAUC (right) score, computed using the dot-product and the sum of agreeing intensities measures (both computed on vector length scaled data), shows how the intensity transformation influenced the classification. As predicted by the analysis of variance stabilisation (cf. Appendix -Peak intensity transformation), the log transformation of intensities performed best for both measures.
Interestingly, in the case of MS/MS data we did not observe any differences in the PAUC score due to intensity transformation (not shown). This was due to the fact that we were able only, using a dataset where spectra IDs where assigned by database searches, to determine whether a pairwise comparison performed different or equal to the database search algorithms. It means, that the MS/MS database searches did not perform better than the pairwise peak-list comparison computed with the worst intensity transformation.

The peak-list length
We examined two ways of defining the length of the matched peak-lists, first by setting N = 0 in Equation 3 (cf. Methods), and second to a user defined value N (N = 250 in the case of PMF data and N = 400 in case of MS/MS   In case of the intensity based dissimilarity measures Manhattan and Euclidean distance, a strong interaction between the factor 'peak-list length' and 'measure' ( Table  4, bottom panel: Final model row length × measure) was observed, except for a case when the L 1 metric (Manhattan distance) was combined with the total ion count (l = 1norm) scaling and the L 2 -metric (Euclidean distance) with the l 2 -norm (vector length scaling). All the other intensity based measures were practically not influenced by the choice of peak-list length N (see Figure 3).

Differences between binary and intensity based dissimilarities
The dash-dotted line in Figure 3 indicates the maximal sensitivity-PAUC determined for the binary based peaklist comparison, while the dashed line shows the maximal sensitivity-PAUC computed for the intensity based peaklist comparison. If high sensitivities at a high specificity (sensitivity-PAUC) were required, the intensity based peak-list comparison performed better than the binary based peak-list comparison. This is because it is very unlikely that samples from different sources would generate spectra where not only the peak masses, but also the peak intensities were similar. However, if high specificity at high sensitivities was required (PMF data only, not A: Boxplot of the sensitivity-PAUC (sensitivity given a FP-rate ∈ [0, 0.1]) itemised according the factors dissimilarity measure and θ (weighting of non-matching peaks) for the binary measure based peak-list comparisons

S−PAUC
shown), the order reversed and binary measures performed better than intensity based measures. Using binary coding makes it unlikely that peak-lists with matching peaks will generate a large distance because of erroneous peak intensity measurement.

Weighting of mass measurement accuracy, computing the non-crossing matching and weighting of non-matching peaks
The variance, which is explained by the factor 'non-crossing matching' (cf. Methods -Finding the matching peaks) and 'weight of matching peaks' (cf. Methods -Weighting the missing mass measurement accuracy) was practically zero in case of the PMF data. In case of the MS/MS data, the variance explained by the factor 'weight of matching peaks' and 'non-crossing matching' was small, but not zero.
Using the measures, which take into account the nonmatching peaks (e.g. Euclidean distance, Huberts Γ), weighting of mass accuracy may decrease the PAUC obtained by a peak-list comparison. This is because weighting of mass accuracy decreases the weight of match-ing peak-pairs, but does not affect the non-matching peaks. For example, in the case of the Euclidean distance, which is given by weighting of match accuracy may decrease the weight of the term ab as well as b 2 and a 2 . For non-matching peaks the term ab equals zero, however, the terms a 2 and b 2 have a full weight. However, matching peaks have higher discriminating power than non-matching peaks. Thus, exclusively decreasing the contribution of matching peaks decreases the discriminating power of a pairwise peak-list comparison. In order to compensate for the effect of mass measurement accuracy weighting we have introduced the weighting of non-matching peaks by parameter θ.
We recommend the usage of both procedures if applying the pairwise peak-list comparison on MS/MS data. Noncrossing matching corrects for errors of the peak extraction procedure. Alternatively, instead of computing the noncrossing matching, the binning of the mass range as described, for example, by Tabb et al. [22] can be used. This however, is limited to data with small mass resolution and a mass range. To decrease the influence of random matches on the dissimilarity, which more frequently occurs in MS/MS peak-lists, the weighting of mass measurement accuracy can be utilised.

Conclusion
Analysis of variance, based on the factorial structure presented in Table 2 and the PAUC as dependent variable, was used to determine the sensitivities of the factors of pairwise peak-list comparison. To test whether these results apply to different types of mass spectrometric data we used both PMF and MS/MS datasets. The amount of variance explained by the factors was similar for both datasets, which provides evidence that the obtained results might be of general interest.
Two factors, namely measure and intensity scaling and their interactions had the highest impact on the intensity based pairwise peak-list comparison. The combination of the Euclidean distance with vector norm scaling, the Manhattan distance with total ion count scaling and the sum of agreeing intensities with vector length scaling were the best performing measures. A high performing measure with small variance due to the choice of scaling methods was the dot product measure. A further factor, which can be used to increase the classification performance of the peak-list comparison is the intensity transformation with the log function as a best choice. In case of the MS/MS data we recommend to apply the weighting of mass measurement accuracy and combine it with a decrease of the weight of non-matching peaks (θ = 0.5), as well as to implement the computation of non-crossing matching. The most important factors for the comparison of the peak-lists using binary measures are the measure, weight of non-matching peaks (θ) and peak-list length N. Symmetric measures with large peak-list length N and a small weight of non-matching peaks (θ = 0.5) performed best for MS/MS data, while asymmetric measures were the most useful during a comparison of PMF data.
A further possible direction to enhance measures of pairwise peak-list dissimilarity would be to combine them with methods that model peak-list properties i.e. peptide fragmentation patterns [17].
The pairwise peak-list comparison is a computer model of cluster affiliation, which for a given input of two peak-lists and various control variables such as weight of nonmatching peaks generates a single output variable further used to classify the peak-list pair. To conduct the analysis presented here a total number of ≈ 4,000,000,000 pairwise peak-lists comparisons was performed (cf. Methods Computation). In order to reduce a number of required computations and explore a wider range of factors and levels, it might be beneficial to apply different methods to design and analyse computer experiments [47].
The recommended pairwise peak-list comparison approaches can be used as predictive functions of within and between cluster associations of mass spectrometric peak-list pairs. However, the best value of the discriminatory variable still needs to be determined. This can be achieved, for example, by the use of ROC curves combined with cross validation analysis, but will require a dataset where the identities of the peak-lists are known apriori.

Datasets and pre-processing PMF-data
The PMF data employed in this study (4532 PMF MS spectra) were derived from three proteome studies.  [48,49]. All PMF MS spectra were derived from tryptic protein digests of individually excised protein spots. For this purpose the whole tissue/cell protein extracts of the former mentioned organisms were separated by two-dimensional (2D) gel electrophoresis [48] and visualised with MS compatible Coomassie brilliant blue G250 [49]. The MALDI-TOF MS analysis was performed using delayed ion extraction and employing the MALDI AnchorChipTM targets (Bruker Daltonics, Bremen, Germany). For positively charged ions in the m/z range of 700 -4,500 m/z were recorded. Subsequently, the monoisotopic masses of the measured peptides were detected by the SNAP algorithm of the XTOF spectra analysis software (Bruker Daltonics, Bremen, Germany). The sum of the detected monoisotopic masses constitute the raw peak-list (peak-list), which were calibrated to a mass accuracy of 0.05Da (or higher) by the in-house developed software mscalib [39,50]. Moreover, the mscalib software was used to filter the peak-lists for irregular peaks that did not follow the general peptide mass rule [12,51]. Additional background peaks (peaks occurring in more than 8% of the spectra [15]) were removed from the peak-lists. The obtained processed peak-lists were then used for the protein database searches with the Mascot search software (Version 1.8.1) [52] employing a mass accuracy of ± 0.1Da, setting methionine oxidation as a variable and carbamidomethylation of cysteine residues as fixed modification, and allowing a maximum of 1 missed proteolytic cleavage site. Samples with multiple content/identifica-A: Boxplot of the specificity-PAUC (specificity given a TP-rate ∈ [0.9, 1]) for the dot-product measure (dotprod) and sum of agreeing intensities (soai) Figure 5 A: Boxplot of the specificity-PAUC (specificity given a TPrate ∈ [0.9, 1]) for the dot-product measure (dotprod) and sum of agreeing intensities (soai). B Boxplot of the sensitivity-PAUC (sensitivity given a FP-rate ∈ [0, 0.1]). N -raw intensities, S -square root transformed intensities, L -log transformed intensities, R -intensity ranks.

MS/MS data
To evaluate the distances for the MS/MS data, 70 clusters (spectra assigned to one ID) were randomly chosen (5 replicates obtained) from a large data-set of identified yeast spectra [53]. The protein extraction, sample preparation, measurement and identification were performed as described by Wagner et al. [54]. The analysed MS/MS spectra were recorded on an ESI Ion Trap mass spectrometer (LCQ DECA, Thermo Electron) with the following instrument settings: spray voltage: 1.5 kVolt; data dependent scanning with one full MS spectrum is followed by four independent MS/MS spectra of the four most intensive ions; minimum signal intensity for a peptide to be selected for fragmentation set to 10 6 ion counts. These selected and fragmented ions were then excluded from further fragmentation events for 1 minute to prevent repeated MS/MS spectra of identical peptides. The collision energy for the peptide fragmentation was automatically set by the instrument, which was controlled by the Xcalibur software (Versionl.2, Thermo Electron). The post acquisition processing was performed with the Bioworks browser package (Thermo Electron). The resulting peaklists were automatically stored and assigned to peptide sequences in the yeast protein database [55], by using the Sequest database search algorithm (Version 27) [8,56].
The search parameters employed for the database searches were as follows: a) none or one of the four proteases, as defined by Sickmann et al. [53]; b) mass type: mono isotopic (parent ion and fragment ion) c) amino acid modifications: carbamidomythylated cysteine residues +57Da and oxidation on methionine residues +16Da, while missed cleavage sites (maximum allowed): 1 missed. We considered spectra identified if they had an Xcorr > 2 and an ion coverage of 20%.

Finding the matching peaks
We considered two peaks x and y from different peak-lists X, Y to match with an accuracy a if |x -y| <a (absolute error) or (relative error in part per million (ppm)). Cases where more than one peak in Y match a peak x (Figure 6, case A) were resolved by computing a non-crossing matching of the peak-lists. A non-crossing matching a maximum trace [57,58] can be computed in time O(n log n), where n is the number of peak matches. In our case, we resorted to a simple maximum similarity alignment, which could be banded to improve its O(n 2 ) time complexity. The optimal trace of two sorted mass lists of matching peaks was established by dynamic programming. Let qual be a measure of the goodness of the match of two peaks i.e. qual abs = max{0, a -|x -y|}, x y x y a − + ⋅ < 2 10 6 Stick spectrum of two peak-lists X (black lines) and Y (black dot dashed lines) Figure 6 Stick spectrum of two peak-lists X (black lines) and Y (black dot dashed lines). Upper left corner -accuracy of the mass measurement a. A -ambiguous match of five peaks. B -unambiguous match of two peaks. C -peaks not matching.
a where x and y are masses of peaks from two distinct peaklists. Then our goal was to maximise the overall quality of all matched peaks. We recorded in the matrix M i, j the best possible assignment of the first i peaks in the first list and the first j peaks in the second list. Hence M i,0 = M j,0 = 0 for all i, j, and it is easy to see that the following recurrence can be used to derive the overall best assignment: In this way we could find a non-crossing matching, which minimised the overall errors and unambiguously assigned a peak x to a peak y.

Weighting the missing mass measurement accuracy
For computation of the dissimilarities we used the weighting of the mass measurement accuracy [45,46] and the alignment of peak-list by linear regression [12,59]. To model the accuracy of a given match either we weighted the peak intensities in the matching pairs (intensity based measures) or calculated the weight of the match (binary measure) by a triangular function w: where a is the maximum displacement evaluated and x and y are peak masses. If the mass difference |x -y| of two matching peaks increases then the significance of the match is reduced. Before computing the weights, we minimised the overall error of the matching masses by adjusting the two peak-lists using linear regression.

Non matching peak pairs
Peaks detected in one sample not occurring in the other one ( Figure 6, case C), were included in the computation of the dissimilarities. The second peak-list was augmented with a peak of zero intensity, at the mass of the not-matching peak. In addition, the significance of such a peak pair (and peak intensities) could be weighted with the factor θ.
In this study we examined three values of θ, namely 0.5, 1 and 2.

Binary measures
We have also investigated measures that only use qualitative information in the sense that they evaluate the number of matching and mismatching peaks of both peak-lists. Essentially these measures are numerical functions in the contingency Table 5 derived from both peaklists. To include the weighting of missing accuracy by the w (see Equation 2) and the weighting of non-matching peaks by θ we introduced a generalised version of the contingency table. All binary measures introduced below can be computed on the entries of the contingency Table 5.
Peaks present in list X, but not in list Y, are denoted by , likewise present in Y, but not in X by . We multiplied the mismatches by θ to assign a variable weight. Therefore, , as well as were replaced by and , respectively. To include the weighting of missing mass accuracy in computing the dissimilarities one can set , with defined by the Equation 2. Our data are asymmetric in the sense that we can only evaluate existing peaks and do not count the absence of peaks in both peak-lists at a mass. Measures that utilise only this information are the Gower coefficient and Fowlkes-Mallows statistics. Additionally, we were interested in the performance of measures that take into account the marginal M and hence the entry is required (Hubert's Γ (Appendix Equation 16) or the relative mutual information (Appendix Equation 19)).
Since the peak-lists can have different length and the maximal peak-list length is undefined, we defined the entry M length of a matched peak-lists pairs as follows: where N is an arbitrary user defined constant and c = 1 in case of the Huberts Γ and c = 0 otherwise. By this definition, due to the use of the maximum function we avoided the case when becomes less than zero (see equation 4 for definition of ). In this study we used two different values of N. We set N = 0 and the second value equal to twice the length of the longest peak-list in each dataset.
Given all entries of the modified contingency Table (5), the marginals could be computed by equations (4 -8): A summary of the measures studied here can be found in the the Appendix section.

Measures based on peak intensities and intensity ranks
Before computing the measures based on peak intensities, we have applied intensity transformations and scaling procedures. The peak intensities were transformed by taking the square root, as suggested by Tabb et al. [22], and the logarithm. Furthermore, we replaced the intensities by their ranks within the peak-list [27]. The scaling procedures used included total ion current count normalisation [34], vector length normalisation, root mean square normalisation and z-score normalisation (a detailed description can be found in the Appendix section).
In our study, we investigated several pairwise similarity measures to compare two peak-lists. These measures are either measures of similarity (such as covariance) or measures of distance (such as the l p metrics). In order to make both classes of measures comparable we transformed each similarity measure into an appropriate dissimilarity measure. Moreover, we introduced the factor w i to weight missing mass measurement accuracy (cf. Methods -Weighting the missing mass measurement accuracy) and non-matching peaks (cf. Methods -Non matching peak pairs).

The dot-product of two vectors is defined
where I X and I Y are the intensity vectors of two matched peak-lists (cf. Methods -Finding the matching peaks) of length N, and is defined by the Equation 2 for matching peaks and equals θ for non-matching peaks. In case of sum-mean-square, total ion count and vector length scaling, the product of non matching peak-pairs is zero and therefore this measure is independent of θ. If the intensities of the matched peak-lists are z-score scaled the outcome will depend on the value of θ. Furthermore, augmenting the peak-lists by zero pairs in order to increase their length will increase DP for z-score and rootmean-square scaled data. The most prominent represent-ative of this family is the spectral angle (the dot-product of vector length normalised data). It has a geometric interpretation. It is equal to the cosine of the angle enclosed by the two vectors.

Covariance
The covariance is a measure of dependency between random variables I x and I y [61] and is defined as: where I X , I Y , N, are defined as above.
The best known representative of this family of measures is the Pearson correlation, which is obtained if we compute the covariance of z-score scaled intensity vectors.

Metric-based measures
The Euclidean and Manhattan distances belong to the family of l p metrics and can be expressed using equation: In case of the Euclidean distance p = 2, and for the Manhattan distance p = 1. The Euclidean distance penalises large intensity differences mores than the Manhattan distance. The outcome of this measure will change due to different sample wise scaling of the intensities. In case of the z-score scaling the outcome will depend on the user defined peak-list length N (Equation 3).

Similarity index and Canberra distance
The Similarity index [33] and Canberra Distance [62] measure the relative distance and can be expressed by the equation: DP I I w I I   Setting p = 2 yields the similarity index, while p = 1 results in the Canberra distance. Similarly, as in case of the l p metrics, the similarity index with p = 2 will be more influenced by large intensity differences than the Canberra distance.
If the term x + y in the denominator equals zero due to x = -y, with x ≠ 0 ∨ y ≠ 0, infinity +∞ is returned [60].

Sum of agreeing intensities
The sum of agreeing intensities is defined by the equation: It shares with the similarity index and the Canberra distance the property that each pair of matching peaks will contribute to the final score a proportion in the range of [0,1/n]. The sum of agreeing intensities however, puts more emphasis on the agreement of peak intensities. Peak pairs whose intensity differences are larger than their average intensity receive a weight of zero.

Computation
All scores presented in the results section were computed for 75 clusters. The clusters were sampled from the datasets without replacement. For each cluster we randomly chose 2 -20 (PMF-data) 2 -7 (MS/MS) samples. This procedure was repeated five times and the average of the scores was computed. The pairwise peak-list comparison approaches were computed with a mass measurement error of 0.7Da for the MS/MS data, and of 0.2Da for the PMF data. The PAUC areas were computed using in-house developed R functions. Other R packages provide a huge variety of statistical tools for further analysis of the dissimilarities such as clustering algorithms and validation or multidimensional scaling methods [67].

Binary measures
Jaccard/Gower coefficient The matching peak count is the dot-product of the two peak-lists and counts the number of matching peaks ( ). Since peak-lists have different numbers of nonzero elements, this dot product must be normalised by the total counts. The Jaccard coefficient is a normalised version of the matching peak count, whose distance version is given by: A generalised version of the Jaccard coefficient in which and is weighted by a constant θ was introduced by Gower et al. [63].

Fowlkes-Mallows statistics
The Fowlkes-Mallows statistics [64] (introduced in the context of clustering validation by use of contingency tables) are the matching peak counts normalised by the geometric mean of the peak-lists lengths. The equation of the distance-like version is given by:    We observed that the numerator was maximised if all signals were expressed equally. To avoid the fact that the denominator becomes zero (which is the case if or = 0 and occurs if one peak-list is included in the other) we set c = 1 in equation (3).

Relative mutual information
We were additionally interested in the performance of information theoretic concepts. Given the two peak-lists, X and Y, the amount of information about peak-list X inherent in peak-list Y (and vice versa) is given by the mutual information (H) [65]: To be able to use the mutual information as a similarity measure, so it could distinguish positive from negative correlation, we introduced the following scaling term [66]: Furthermore, we adjusted it for the information inherent in the individual peak-lists. The adjustment was done using the entropy of the individual peak-lists, which for a peak-list X is given by: Thus, we defined the relative mutual information: The relative mutual information is small if both peak-lists are similar and high if they differ. The scaling is preferred if intensities and variance in an arbitrary sample are much higher than in the other samples, which will determine the outcome of the peak-list comparisons. Data transformation was applied before the peak-list matching, whereas data scaling was performed for already matched peak-lists.

Features of the pairwise peak-list comparison and their properties
The number of matches While the intensities of individual peaks may considerably vary between the spectra, the m/z values of fragment ions can be measured with at least the accuracy of a single m/z in the majority of mass spectrometers. If the primary fragment ions/peptides in a pair of spectra have the same m/z locations, the spectra are judged to result from the same peptide/protein, regardless of their peak intensities. Table 6 (rows one and four) summarises the properties of both: mass measurement error (MME) and the mass measurement range of the peak-lists. It also provides the five-number summary and the mean of the peak-lists lengths. The observed number of matches for within and between cluster peak-list pairs is shown in Table 6 (rows two, three, five and six). The theoretical probability of i matches if two independent peak-lists of known length, mass measurement range and resolution are compared, can be modelled using the hyper-geometric distribution [40]. In case of PMF data, for peak-lists drawn from between clusters, a higher number of matches than expected for independent peak-lists was observed. This difference might be due to incomplete separation of proteins obtained after two-dimensional (2D) gel electrophoresis [41] and because the sequence database entries have different database IDs, even if the protein sequences exhibit a high fraction of sequence identity (i.e. protein families).
For 75 clusters of various size (2 -20 samples/cluster) sampled five times from the PMF dataset, we have computed the number of matching peaks for all peak-list pairs. The number of matching peaks was in almost all cases higher, if the peak-lists compared laid within one cluster (magenta histogram, Figure 7A), than if they occurred between different clusters (green histogram, Figure 7A). For example, 95% of within cluster peak-list pairs had more than 4 matches, but only 1% of between cluster peak-list pairs had more than 4 matches. The cases where the number of matches between peak-list from within one cluster equalled zero, can be explained by the fact that the A -Histogram of the number (bandwith = 1) of matching peaks for peak-lists chosen from the same cluster (magenta) and from different clusters (green) Figure 7 A -Histogram of the number (bandwith = 1) of matching peaks for peak-lists chosen from the same cluster (magenta) and from different clusters (green). B -Histogram of the number (bandwith = 3) of non-matching peaks, if peak-lists were chosen from the same (magenta) or from different clusters (green). The masses of randomly matching peaks differ, on average, more than the masses of non-random matching peaks. Therefore, weighting of mass measurement accuracy using a triangular function (see Equation 2) was implemented. This function reduced the weight of peaks with a small overlap.
Furthermore, in case of the MS/MS peak-lists, clusters of peaks separated by a mass smaller than the mass measurement accuracy (which is used for searches of matching peaks) were observed. Therefore, during matching the two peak-lists, some ambiguous matches (that is a peak is assigned to more than one peak in the second peak-list) occurred (cf. Methods - Figure 6, case A). In order to generate an unambiguous pairwise assignment of peaks we computed the non-crossing matching using standard dynamic programming techniques (cf. Methods -Finding the matching peaks).
We concluded that the probability of matches between independent peak-lists is higher in case of MS/MS than PMF data because of its lower mass measurement accu-racy, smaller mass range and larger number of peaks. Hence, the number of matches has a lower discriminating power in case of MS/MS than of PMF data.

The number of non-matching peaks
To discriminate peak-list pairs as being within or between clusters the number of non-matching peaks can be used. Figure 7B presents histograms of the number of nonmatching peaks between peak-list pairs (in magenta -the number of peaks that did not match if we compared two peak-lists within a cluster; in green -the number of peaks that did not match if we compare peak-lists pairs between two clusters). We observed that the probability of encountering a within peak-list pair increased if the number of non-matching peaks was small.
We have evaluated the performance of the following asymmetric binary measures: Gower coefficient [63] (cf. Appendix Equation 14) and Fowlkes-Mallows statistics (cf. Appendix Equation 15). These measures incorporate the number of matches and mismatches. If the length of the aligned peak-lists is defined (see Equation 3), symmetric binary measures e.g. Huberts Γ (Appendix Equation 16) and relative mutual information (Appendix Equation 19) can also be used. Furthermore, we examined Peak Intensities Figure 8 Peak Intensities. A) Histogram of intensities: X-axes -Intensity of log transformed root-means-square scaled peak intensities. Y-axis -Frequency. In grey: Histogram of the peak intensities that do not match a peak in any other peak-lists (peak-lists) within the same cluster (this mass is observed only once in the cluster). In magenta: Histogram of intensities of peaks that do match a peak within any peak-list within cluster (this mass is observed at least twice in the cluster). B) Altman Bland plot of intensities of the matching peaks for peak-lists pairs from within a cluster. C) Altman Bland plot of intensities of matching peaks for peak-lists pairs of between clusters. whether increasing or decreasing the weight of nonmatching peaks by a factor of two can increase the performance of the pairwise peak-list comparison (cf. Methods -Non matching peak pairs).

Peak intensities
Intensities associated with the masses observed at least twice within a cluster (magenta density, Figure 8A) tend to have higher peak intensities, compared to intensities of peaks whose masses are observed only once within a cluster (grey density, Figure 8A). Furthermore, intensities I X and I Y , of matching peaks in peak-lists from within a cluster, were more strongly correlated ( Figure 8C). The correlation was determined for log transformed and root mean square scaled peak intensities. This indicated that the intensity of peaks could be employed for better discrimination of within and between cluster peak-list pairs. We have studied the performance of several intensity based measures: the covariance (cf. Methods Equation 10), the dot-product (cf. Methods Equation 9), the Manhattan and Euclidean distances (cf. Methods  Equation 11), the relative distances Canberra and similarity index (cf. Methods Equation 12), and the sum of agreeing intensities (cf. Methods Equation 13).

Peak intensity transformation
If two peaks match within a cluster, the peak intensities are very likely (except random matches) to be estimates of a number of the ions of the same peptide (PMF) or peptide fragment (MS/MS). These estimates might contain errors resulting from random noise, different levels of peptide fragmentation due to variations in collision energy and different signal-to-noise ratios due to varying concentrations of sample present [22].
The observed error can depend on the observed intensity.
Thus, any statistical model would either directly account for the variances or transform the data so that the variances are approximately equal for all peak intensity levels.
To be able to determine the best variance stabilising transformation, one can examine the proportionate reduction in variation R 2 [42], obtained by analysis of the model |∆I| Ī + Ī 2 , where ∆I = I X -I Y are the residues and Ī = (I X + Y Y )/2 represent the average peak intensity of two matching peaks. This model accounts for a correlation of variance and intensity (|∆I| ~ Ī), unlike the naive model ∆I = E(∆I) [43]. If the variance is stable, the naive model suffices, and the proportionate reduction in variation obtained with the complex model |∆I| ~ Ī + Ī 2 should be close to zero.
The Altman-Bland plots [43] in Figures 8B and 8C, show the residues (∆I = I X -I Y ) as a function of the average peak intensity Ī = (I X + I Y )/2, where I X and I Y are the intensities of a matching peak pair (X, Y). The peak intensities are log-transformed and root mean square scaled (cf. Methods -Equation 22). Table 7 shows the adjusted R 2 of the model |∆I| ~ Ī + Ī 2 for various peak intensity transformations. The log-transformation gives the best variance stabilisation.
To elucidate to which extent the transformation influences the PAUC score, as compared to other factors, we kept the different transformations as factor levels of the pairwise peak-list comparison process. This wasdone despite the fact that the best transformation was determined by the analysis of the proportionate reduction of variance. In addition to the raw, root-squared and logtransformed intensities we included the ranking of the intensities [27] among the transformations studied by ANOVA.

Abbreviations
• ANOVA -analysis of variance • ROC -receiver operating characteristic curve.
• PAUC -partial area under the curve.

Authors' contributions
HL, JG, KR, ML and RH gave initial input to the research.
WEW implemented the dissimilarities, evaluation framework, and designed the figures.
WEW, RH and PM planned and carried out the analysis.  JG and PG provided the PMF datasets.
All authors contributed to the final version of the manuscript and approved it.