- Methodology article
- Open Access
Enhanced peptide quantification using spectral count clustering and cluster abundance
BMC Bioinformatics volume 12, Article number: 423 (2011)
Quantification of protein expression by means of mass spectrometry (MS) has been introduced in various proteomics studies. In particular, two label-free quantification methods, such as spectral counting and spectra feature analysis have been extensively investigated in a wide variety of proteomic studies. The cornerstone of both methods is peptide identification based on a proteomic database search and subsequent estimation of peptide retention time. However, they often suffer from restrictive database search and inaccurate estimation of the liquid chromatography (LC) retention time. Furthermore, conventional peptide identification methods based on the spectral library search algorithms such as SEQUEST or SpectraST have been found to provide neither the best match nor high-scored matches. Lastly, these methods are limited in the sense that target peptides cannot be identified unless they have been previously generated and stored into the database or spectral libraries.
To overcome these limitations, we propose a novel method, namely Quantification method based on Finding the Identical Spectral set for a Homogenous peptide (Q-FISH) to estimate the peptide's abundance from its tandem mass spectrometry (MS/MS) spectra through the direct comparison of experimental spectra. Intuitively, our Q-FISH method compares all possible pairs of experimental spectra in order to identify both known and novel proteins, significantly enhancing identification accuracy by grouping replicated spectra from the same peptide targets.
We applied Q-FISH to Nano-LC-MS/MS data obtained from human hepatocellular carcinoma (HCC) and normal liver tissue samples to identify differentially expressed peptides between the normal and disease samples. For a total of 44,318 spectra obtained through MS/MS analysis, Q-FISH yielded 14,747 clusters. Among these, 5,777 clusters were identified only in the HCC sample, 6,648 clusters only in the normal tissue sample, and 2,323 clusters both in the HCC and normal tissue samples. While it will be interesting to investigate peptide clusters only found from one sample, further examined spectral clusters identified both in the HCC and normal samples since our goal is to identify and assess differentially expressed peptides quantitatively. The next step was to perform a beta-binomial test to isolate differentially expressed peptides between the HCC and normal tissue samples. This test resulted in 84 peptides with significantly differential spectral counts between the HCC and normal tissue samples. We independently identified 50 and 95 peptides by SEQUEST, of which 24 and 56 peptides, respectively, were found to be known biomarkers for the human liver cancer. Comparing Q-FISH and SEQUEST results, we found 22 of the differentially expressed 84 peptides by Q-FISH were also identified by SEQUEST. Remarkably, of these 22 peptides discovered both by Q-FISH and SEQUEST, 13 peptides are known for human liver cancer and the remaining 9 peptides are known to be associated with other cancers.
We proposed a novel statistical method, Q-FISH, for accurately identifying protein species and simultaneously quantifying the expression levels of identified peptides from mass spectrometry data. Q-FISH analysis on human HCC and liver tissue samples identified many protein biomarkers that are highly relevant to HCC. Q-FISH can be a useful tool both for peptide identification and quantification on mass spectrometry data analysis. It may also prove to be more effective in discovering novel protein biomarkers than SEQUEST and other standard methods.
The main objective of functional proteomics analysis is often to estimate changes in the amount of proteins found in complex biological systems, in response to physiological and clinical factors such as cell development, disease progression, or drug treatment. In particular, one of the key issues in proteomics research based on tandem mass spectrometry (MS/MS) is the identification of protein species and the characterization of their expression changes in normal and disease samples. Three analysis techniques are often required in an MS/MS study: expressed peptide identification, target protein characterization, and quantification . For hundreds to tens of thousands of fragment ion spectra generated, the assignment of the fragment ion spectra to peptide sequences, the identification of proteins represented by each peptide, and the estimation of their abundances in the analyzed sample require complex computations and still remain as high statistical challenges .
Quantification of protein expression using mass spectrometry (MS) is often required for the discovery of protein biomarkers associated with cancer, their response to stimuli, cell signalling cascades and the function of cell cycle-promoting proteins, and various biomedical investigations . Two categories of quantification methods for MS data have been used: stable isotope labelling quantification and label-free quantification .
Several stable isotope-based quantification methods have been introduced based on different labelling reagents that can be chemically bound to peptides . It is, however, difficult to simultaneously quantify the amount of proteins/peptides in multiple samples because of the limited number of labelling reagents available . Moreover, current practical applications can typically quantify, at most, a few hundreds of peptides, measuring relative expression values of each pair of contrasting samples. Furthermore, the high costs of labelling reagents make these quantification methods difficult to be commonly applied for the characterization of the global proteome.
On the other hand, label-free quantification, which does not require the use of a stable isotope labeling, has the advantages of low cost and simplicity. Currently, two label-free methods are available to measure expression levels of peptides: spectra counting and spectra feature analysis. The spectral counting method can estimate the peptide expression levels by means of spectrum counting (from MS/MS data) or through the estimation of the integrated ion intensities [6, 7]. The spectral feature analysis method quantitatively determines the peptide expression levels by comparing three-dimensional patterns (retention time, m/z and intensity) between different samples [8–13].
However, these label-free quantitative methods have two main shortcomings. The first limitation is due to numerous false-positive discriminative peptides, which are the result of the chromatographic variability between LC-MS experiments. In the analysis of the spectra features, after finding two candidates with same MS1 retention time and m/z, the difference in their MS1 intensities is used to define the peptide levels. Therefore, spectra feature analysis requires stringent reproducibility [3, 8] and additional pre-processing of the LC normalization or retention time alignment [14, 15].
The second limitation is that spectra counting cannot be performed without peptide identification because the relative peptide levels can be quantified only after peptide identification. In peptide identification, MS/MS spectra are verified using a database searching algorithm or spectral library searching algorithm such as SEQUEST, MASCOT, or SpectraST. Specifically, database search algorithms calculate score functions to compare the experimental MS/MS spectra with theoretical MS/MS spectra of peptides derived from protein sequence databases. The pool of theoretical MS/MS spectra is restricted by user-specified criteria such as mass tolerance, proteolytic enzymes, and the types of post-translational modification [2, 16]. A number of spectra may not be assigned to the correct peptides for diverse reasons, including deficiencies of the scoring scheme implemented in the database search tools, sequence variations (e.g., single nucleotide polymorphisms, SNPs), omissions in the database searched, post-translational or chemical modifications of the peptide analyzed, and the observation of genomic sequences that are not anticipated (e.g., splice forms, somatic rearrangement, and processed proteins) . For all these reasons, a large number of important peptides may be lost during the database search.
Instead of matching acquired MS/MS spectra against theoretically predicted spectra, MS/MS spectra can also be assigned to peptides by matching those in a spectral library. The spectral library is compiled from a large collection of experimentally observed MS/MS spectra identified in previous experiments . Generally, a set of spectra of known peptide sequences is collected into a library and used as a reference. The experimental spectrum may be identified by a similar match in the library. However, this method can only be identified when spectra were observed previously and entered into the library. So, these library searching methods are well suited for targeted proteomics, in which one seeks not to discover previously unseen peptides, but rather limited to finding and quantifying expected peptides of interest in the sample .
To overcome these limitations of label-free quantification methods, we propose a novel spectral counting method to estimate a peptide's abundance by counting MS/MS spectra, comparing and clustering all experimentally observed spectra. This approach has several advantages. First, because the same peptide may be fragmented multiple times or repeatedly observed at different time points from an MS/MS run, multiple spectra may be extracted for the same peptides. In other words, duplicated spectra are ubiquitous in large-scale proteomics data . Our method thus attempts to identify and group all the duplicate spectra, which allows us to quantify the amount of peptide found in complex biological systems without searching through the databases or using LC normalization.
For the given spectra, our method, referred to as the Quantification method derived by Finding the Identical Spectra set for a Homogenous peptide (Q-FISH) employs a two-stage clustering algorithm to determine whether they are from the same peptides with homogeneous spectral patterns. The Q-FISH algorithm employs two similarity measures: the difference between two precursor ions and the correlation coefficient of moving window averages. Subsequently, the algorithm clusters spectra from the same peptide through all plausible pair-wise comparisons. By counting the spectra of each cluster set of peptides, we can estimate the amount of peptides. Figure 1 summarizes the workflow of the proposed Q-FISH algorithm.
Our proposed algorithm was applied to identify differentially expressed peptides from a real data obtained during a Nano-LC-MS/MS experiment performed on human HCC and normal liver tissue samples.
Results & Discussion
We introduced and tested the so-called Q-FISH algorithm to identify and quantify the amount of all expressed peptides from an MS/MS dataset by clustering and counting spectra with homogeneous spectral patterns. In order to test our algorithm, we performed a Nano-LC MS/MS experiment with triplicated human hepatocellular carcinoma and normal liver tissue samples. For a total of 44,318 MS/MS spectra obtained through three MS/MS analysis for two samples, Q-FISH yielded 14,748 clusters. More specifically, 5,777 clusters were identified only in the hepatocellular carcinoma (HCC) sample, 6,648 clusters only in the normal sample, and 2,323 clusters in both HCC and normal samples. For the purpose of comparison, we also implemented SEQUEST and SpectraST to identify peptides. However, only 4,824 of 44,318 spectra were identified using SEQUEST, and a total of 1,326 peptides from the experimental spectra. Generally, most database search algorithms including SEQUEST assign specific experimental spectra to peptides by comparing the experimental data with theoretical spectra generated from the peptide sequence. It should be noted that neither the best match nor a high search score may not be a true match, especially for novel protein targets. Therefore, many peptides could be misidentified, or not be identified, unless they were previously generated and stored into the database sequence. In our experiments, a large number of experimental spectra (89.12%, namely 39,494 of a total of 44,318 spectra) could not be used for the peptide identification using SEQUEST. On the other hands, 5,549 spectra and 3,295 peptides could be identified using SpectraST. That is, a large number of spectra still could not be used for the peptide identification by SpectraST (87.48%, namely 38,769 of a total of 44,318 spectra). On the other hand, our proposed method directly compares all observed experimental spectra to discover differentially expressed peptides without a loss of observed spectra.
The standardized intensities of the experimental spectra plotted in Figure 2 are characterized by positive intensity values (upper part) and the reference spectrum plotted using negative intensity values (lower part). Specifically, Figure 2(a), which illustrates an example of one cluster with nine similar spectra, shows spectral patterns of the MS/MS spectra as well as the reference spectrum for clustered spectral set. The overall patterns look quite similar and all nine spectra pairs seem to have almost identical patterns. Table 1 shows the search results returned by SEQUEST and SpectraST. Subsequently, in the case of spectral set S366006, nine spectra were identified by means of the same peptide sequence, "SIFSAVLDELK" in the SEQUEST and SpectraST with XCorr above 1.97. In addition, a reference spectrum for the clustered spectral set was identified as the peptide sequence, "SIFSAVLDELK" with a SEQUEST score, XCorr = 2.96. This analysis reveals that these spectra can be regarded as the spectra of a homogenous peptide. In other words, each cluster could be expected to be composed of spectra from the same peptide.
Similarly, Figures 2(b) and 2(c) show spectral patterns for the reference spectrum and the experimental spectra of a single cluster. It should be noted that the overall patterns look quite similar and all spectra pairs are characterized by high correlation coefficients. However, while all spectra in S1157004 could be identified by SpectraST, two out of the eleven spectra could not be identified by SEQUEST, as shown Table 1. On the contrary, all spectra in S65002 are identified by SEQUEST with high scores, while three spectra could not be identified by SpectraST. In other words, if we relied only on the conventional peptide identification such as SEQUEST or SpectraST, these spectra would have been excluded despite the similar peak patterns. On the other hand, our Q-FISH algorithm was able to include these spectra without a loss of information.
In this study, we were interested in identifying proteins and characterizing their differential expressions in normal and HCC samples. Hence, we first focused on the 2,323 clusters, which were observed in both samples. Figure 3 and Table 2 show a scatter plot and a correlation matrix with the number of spectra in the same cluster, which were obtained through the replicated experiments on HCC and normal tissue samples, respectively. It is worth noting that the number of spectra in the same cluster showed high correlations (0.7178~0.8315), while the number of spectra for different samples showed weak correlations (0.0654~0.1549). For a given spectral set, the reference spectrum was estimated by averaging the relative intensities of the spectra. Consequently, the reference spectrum corresponds to the number of expressed spectra in the normal and HCC samples. We computed the false clustering rate (FCR) on the 2,323 clusters shared by the HCC and normal samples. Among these clusters, 1,571 clusters had FCRs smaller than 0.05. Our next step was to perform a beta-binomial test to isolate differentially expressed peptides (DEPs) . The result showed that only 84 out of the 1,571 reference spectra were characterized by different spectral counts between the HCC and normal tissue samples. Also, 5,777 clusters were observed only in the HCC sample and 6,648 clusters only in the normal sample by Q-FISH. Among these clusters, 1,571 and 1,556 clusters, respectively, had FCRs smaller than 0.05.
In order to compare the performance of Q-FISH with the spectral counting method by SEQUEST, we used the human liver data and validated the results through literature search. For the human liver data, Q-FISH provided 1571 differentially expressed clusters for HCC sample and 1556 for normal sample, among which 57 and 99 clusters were identified by SEQUEST in HCC and normal samples, respectively. On the other hand, SEQUEST provided 93 and 145 peptides for HCC and normal tissue samples, respectively. Among the 57 identified clusters in HCC samples, 37 clusters were found to be over-expressed by Q-FISH; 20 peptides/clusters were overlapped by Q-FISH and SEQUEST. On the other hands, 73 peptides were identified only by SEQUEST. 49 peptides/clusters were identified as over-expressed by both Q-FISH and SEQUEST in normal sample. Also, 50 and 96 peptides/clusters were identified as over-expressed only by Q-FISH and SEQUEST, respectively.
We compared two results through literature search. We assumed that it is a true match if a peptide was reported in a previous literature in cancer. While there is a certain degree of uncertainty for reported protein biomarkers, this assumption is not biased to any of the two methods and allowed us to statistically compare their performance. For examples, alpha-2-macroglobulin (A2M) annotated by "VSVQLEASPAFLAVPVEK" was reported to be over-expressed in HCC sample . This peptide was found to be over-expressed by Q-FISH, but under-expressed by spectral counting analysis by SEQUEST. The full list of peptides is given in Additional file 1. Based on this report, the 2 × 2 confusion tables can be constructed as shown in Table 3.
For Q-FISH result, 65 peptides were found in the literature: 31 for HCC sample and 34 for normal sample. Among 31 peptides for HCC sample, 25 are reported as over-expressed in the literature, and are assumed to be correctly identified. Among 17 peptides for normal sample, 17 are reported as under-expressed in the literature, and thus are assumed to be correctly identified. The remaining 17 and 6 peptides are assumed incorrectly identified.
For SEQUEST result, 93 peptides were reported in the literature: 43 for HCC sample and 50 for normal sample. Among them, 34 and 24 peptides were correctly identified, while 26 and 9 peptides were incorrectly identified. Based on these numbers, accuracy measure was computed showing that Q-FISH (accuracy = 64.62%) has slightly higher accuracy than SEQUEST (accuracy = 62.37%). This comparison showed that Q-FISH performed as reliably as SEQUEST, despite the comparison giving SEQUEST a natural advantage.
Table 4 provides a list of potential protein biomarkers. Q scores were calculated by averaging the correlation coefficient between moving averages over the reference spectrum and experimental spectra of the clustered spectral set. If it has a relatively high value, then the reference spectrum is well represented in the clustered spectral set.
To find the potential biomarkers in each sample, we searched the reference spectra of clusters using SEQUEST. Consequently, we could find 50 and 95 peptides as the candidate biomarkers from HCC sample and normal sample, respectively, as shown Table 4. Among them, 24 peptides in HCC sample and 56 peptides in normal samples are known biomarkers for the human liver cancer. Also, 22 reference spectra among 84 DEPs were identified by SEQUEST. Among them, 13 peptides are known markers for the human liver cancer, too.
As shown in Table 4, carbamoyl-phosphate synthetase 1 (CPS1) are annotated by various sequences such as "MEYDGILIAGGPGNPALAEPLIQNVR" "SIFSAVLDELK", "TAVDSGIPLLTNFQVTK" and "GLNSESMTEETLK". These sequences are underexpressed in the HCC sample. Kinoshita et al.  performed differential gene display analysis (DGDA) to compare the intensities of polymerase chain reaction (PCR) products and evaluated the degrees of mRNA expression in HCC tissue samples and noncancerous hepatitis tissues. Subsequently, they confirmed that CPS1 is underexpressed. Specifically, CPS1 synthesizes carbamyl phosphate from bicarbonate, adenosine triphosphate (ATP) and ammmonia. A genetic mutation of CPS1 was identified as the source of hyperammonemia. In HCC tissue samples, underexpression of the CPS1 gene had been reported in rats, but the scientists' study was the first to result in such a finding for humans . Heterogeneous nuclear ribonucleoprotein C (HNRNPC) annotated as "MIAGQVLDINLAAEPK" and actin, cytoplasmic 1 (ACTB) annotated as "DLYANTVLSGGTTMYPGIADR" were found to be over-expressed in the HCC sample [24, 25]. On the contrary, glutathione S-transferase (GSTA1) annotated as "NDGYLMFQQVPMVEIDGMK" has been down-regulated in the human HCC sample . Moreover, fatty acid-binding protein (FABP1) annotated as "SVTELNGDIITNTMTLGDIVFK", and Isoform 1 of Liver carboxylesterase 1 (CES1) annotated as "EGYLQIGANTQAAQK" are all characteristic of the HCC sample [27, 28].
As shown in Table 4 many peptides are also known to be associated with cancer. Specifically, EMILIN-1 (EMILIN1), elongation factor 1-delta (EEF1D), galectin-7/p53-induced gene 1 protein (LGALS7), hemoglobin subunit beta (HBB) and malate dehydrogenase 2 (MDH 2) are differentially expressed in breast cancer cells [29–31]. Consequently, the LGALS7 gene is known to be related to over-expression when compared with control cells. Likewise, our result was also over-expressed. Table 4 provides a list of different types cancers associated with specific genes [28–34]. Figure 4 shows a scatter plot of the spectral counts of normal and HCC samples. The × axis and y axis represent the number of expressed spectra in each HCC and normal sample. Specifically, the symbol "▲" indicates DEPs identified with the use of SEQUEST, whereas the symbol "●" indicates unidentified DEPs. However, 62 DEPs were not identified by SEQUEST despite their significant differences by the beta-binomial test.
We believe there were several reasons why 62 DEPs were not identified by SEQUEST. First, "one-size-fits-all" search parameter values of SEQUEST would not have been chosen appropriately for this protein target. Second, these unidentified DEPs may have other post-translational modification, sequence variation (e.g., alternative splicing) or insufficient peptide ions information.
We re-run SEQUEST with many different parameter options for allowing phosphorylation modification and two missed cleavages, and for using other sequence databases (NCBI nr and EST human). However, even with these parameter options, SEQUEST did not identify the remaining 62 DEPs. Next, we tried to identify 62 reference spectra using other searching engines such as MASCOT and SpectraST. MASCOT identified 2 DEPs, Alcohol dehydrogenase 1A (ADH1A) and Isoform 2 of Myosin-9(MYH9) but SpectraST did not identify any DEPs. The remaining 60 DEPs could not be identified by these search engines. In order to identify these DEPs, further experiments may be needed. For example, additional MS/MS experiments such as MRM (Multiple Reaction Monitoring) or SRM (Selective Reaction Monitoring) can be carried out within the range of the corresponding retention times for all the unidentified spectra in order to collect more detailed peptide information.
In this paper, we proposed a novel method to estimate peptide's abundance by counting MS/MS spectra clustered through the direct comparison of all experimentally observed spectra. For a given pair of spectra, our method can be used to answer the question of whether they are from the same peptide without computationally searching them from a theoretical library of protein spectra. Examining all possible pair-wise comparisons, our method results into a set of spectra for the same peptide and enables us to estimate the amount of peptides found in biological samples of interest by counting the spectra clusters. Since our proposed method compares all possible pairs of experimental spectra, it can discover even modified and unknown peptides, which may not be searchable from a theoretical spectral library. For practical MS/MS experimental data, a large proportion of spectra are often misidentified or completely lost during a computational database search. On the other hand, Q-FISH can identify these spectra without any loss of information. As demonstrated in our practical examples, the majority of DEPs derived by Q-FISH were found to be highly related with various cancers, which were not discovered by other methods.
We thus believe our Q-FISH algorithm will be highly useful in the identification of novel peptides . Also, Q-FISH has the potential to find applications in many other practical proteomic studies. For example, it can be used to discover unknown biomarkers or drug targets through the comparison of proteins with statistically significant difference and by quantifying sets of identical peptides in multiple samples. Unknown spectral clusters can often come from non-peptide contaminants as revealed by a recent publication . Q-FISH can evaluate the significance of such unknown clusters, some of which can be novel biomarkers, requiring further experimental confirmation by de novo sequencing, unrestricted sequence database search (using e.g. InsPect ) or spectral library search (using e.g. pMatch ).
Sample Preparation, Nano-LC-ESI-MS/MS
Tissue samples such as hepatocellular carcinoma (HCC) tumour tissue and adjacent healthy liver tissue were collected under the guidelines of the Institutional Review Board (IRB) established at Yonsei Medical Center (Seoul, Korea). All tissues were prepared and subsequently, in-solution tryptic digestion was performed as previously described . Nano-LC-MS/MS analysis was performed on an Agilent Nano HPLC 1100 system using an linear trap quadruple (LTQ) mass spectrometer (Thermo Electron, San Jose, US). LC-MS/MS was performed as previously described . The peptide fractionation was performed by means of cationic exchange chromatography (SCX) at a flow rate of 0.5 mL/min where absorbance of the column effluent was maintained stable at 280 nm for 40 min. Fractions were automatically transferred every 0.5 min into a 96-microplate.
Nano-LC MS/MS experiments were carried out three times on two different samples (human liver cancer and normal tissues) and 44,318 MS/MS spectra were generated. These tandem mass spectrometry data were first analyzed by means of the database search software SEQUEST (Bioworks 3.2, ThermoFinnigan, San Jose, US). The sequence database downloaded from European Bioinformatics Institute (EBI) was the International Protein Index (IPI) human version 3.61. The next step was to combine the protein sequence database with its reverse sequences. The maximum number of missed cleavage sites was set to 1, and only tryptic cleavage after arginine and lysine was allowed. The mass tolerance of the precursor peptide ion was set to 3.0 Da, while the fragment ion tolerance was set to 0.5 Da. These tolerance values were chosen to minimize FDR when XCorr > 1.5 . Modification at cysteine with carboxyamidomethylation and methionine with oxidation were allowed . All peptides assigned to reverse sequence were removed before proceeding to peptide identification to inhibit false-positive identifications. We chose XCorr as 1.44(+1), 1.97(+2) and 3.13(+3) which yielded FDR close to 0.05, respectively, and the value of DeltaCn is equal to a great than 0.1. These score criteria were considered to ensure high confidence in the results of protein identification . The spectra derived by mass spectrometry were also analyzed by means of the spectral library search software SpectraST, which was initially developed by the Institute for Systems Biology (ISB) and National Institute of Standards and Technology (NIST). SpectraST is integrated with the Trans-Proteomic Pipeline (TPP) software suite, which provides the supporting functionalities necessary in a full proteomics data analysis pipeline. Then, the SpectraST program was validated in the NIST Human IT Library with the SpectraST's scores > 0.9 [18, 38, 42]. The precursor tolerance was set to 1.5 Da/z (Thomson).
Q-FISH algorithm for direct comparison of experimental spectra
We assumed that MS/MS spectra from the same peptide would present similar patterns. Under this assumption, the proposed Q-FISH algorithm can be applied to find DEPs both in normal and disease samples. As shown in Figure 1, to evaluate the similarities between two spectra, we use a correlation coefficient of the moving window averages. The analytical process is summarized as follows:
1. Scale Standardization
Perform scale standardization by dividing the intensity values by its maximum value.
2. Moving average
Compute the moving window average over the spectra using a window of fixed size.
3. Correlation index for moving average-based peak patterns
Calculate a summary statistic based on the correlation coefficient of the moving averages between two spectra.
4. Spectral count-based quantification using two-stage clustering
Cluster duplicated peptides with similar peak patterns and retention time using a two-stage clustering method.
5. Identification of differentially expressed peptides
Employ the beta-binomial test to identify DEPs among the experimental groups.
Similarity measure between pairs of MS/MS spectra
Because the intensities of the spectra obtained may be different for various physical and chemical reasons such as inconsistencies in the total ion currents, we cannot use the raw data for the intensity of m/z peaks. In light of this, we used a scale-standardization method, which involves the division of the m/z peak values for all ions by their maximum value. Let x[i] be the intensity of the ith m/z peak. Then, the scale standardized intensity, y[i], is defined by
Moving window average
To reduce the background noise of the peak intensities, the moving window average (MWA) is used. The most simple moving average is the unweighted (or uniformly weighted) average of n data points within a given window, and the weighted moving average (WMWA) is the average calculated using multiplying weight factors to give different weight to each data point. Among the various options for the weights of WMWA, we selected the "Gaussian" kernel, which uses the probability density function (pdf) of the standard Gaussian distribution with mean 0 and variance 1 as a weight function.
For a given spectrum, the MWA is calculated by averaging the peak intensities within the sliding window sequentially for all m/z peaks. In other words, the MWA is not a single value, but a set of averages. The next step is to calculate correlation between the MWAs of two spectra and determine whether there are identical spectra from the same peptide.
We assume that there are N moving windows of fixed size K along the entire m/z range. Subsequently, the WMWA for the ith moving window (i = 1, 2,..., N) is defined by
where y[i + j] is the jth scale standardized intensity in the ith moving window and w j are the weights. For a uniform kernel w j = 1/K or the Gaussian kernel, w j = Φ(z j ) represents the pdf of the standard Gaussian distribution, where z j represents the value of y[i+j] standardized by using mean and variance of m/z's in the ith window. Total number of windows, N can be determined by the fixed window size K along with the entire m/z range (200-2000 Da). In order to determine the optimal window size, we randomly selected some pairs of spectra from the same and different peptides using target-decoy sequence database. We implemented receiver operating characteristic (ROC) analysis to determine the window size. Based on ROC analysis, we chose a window size, K = 30 (3.0Da) and accordingly N = 19,771 (20-2000 Da at interval of 0.1 Da). However, the areas under the curve (AUC) did not differ much and were less sensitive to the window size.
Correlation index for moving average-based peak patterns
For peptides p and q, the correlation coefficient is computed as follows:
where and are the means of moving window averages for peptide p and q. The closer the correlation coefficient is to 1, the stronger is the correlation between spectra from the same peptides.
Quantification by counting spectra in clustered spectra set from a homogenous peptide
Two-stage cluster analysis is used to cluster peptide sets consisting of spectra with similar patterns. As previously assumed, if the spectra have approximately the same shape, then the spectra would have come from the same peptide. Namely, each cluster can be expected to be composed of the spectra obtained from a homogenous peptide. Two-stage clustering analysis employs two similarity measures to cluster peptides: the first is the difference between precursor ions and the second is the correlation coefficient between two MWAs. It is theoretically predicted that MS/MS spectra obtained from the same peptide have similar precursor ions. First, clusters can be defined in terms of pair-wise differences between the precursor ions. For any two pair of precursor ions in the same cluster, their difference is smaller than the threshold value. In our analysis, we set ± 1 Da as a threshold value. The next step is to perform a hierarchical clustering analysis for each of the clusters defined. Specifically, we employ "single linkage," also known as the nearest neighbour technique. Here, the correlation coefficient of MWAs is used as a similarity measure.
Because this two-stage clustering analysis yields clustered spectra sets consisting of MS/MS spectra from the same peptide, the amount of peptides can be quantified by counting the spectra included in each clustered set. Lastly, representative spectra called "reference spectra" can be defined based on the basic patterns of precursor ions as the average spectra for a given spectral set.
Validation of the clustering results using retention times
It is well known that the same peptides tend to elute continuously within a limited liquid chromatography (LC) interval. Thus, the clustering results can be validated using the retention time (RT) information.
In order to validate the clustering results, we propose a new measure to estimate the clustering error rate using the spectral RT information. Note that the Q-FISH results provide the list of clusters. If a cluster contains only peptides from the same spectra, the RTs of peptides would have similar values. If a cluster contains peptides from the different spectra, the RTs would have different values. As a measure of similarity, we consider the measures representing the variability of RTs from the same cluster such as coefficient of variation (CV) and standard deviation (SD) of RTs. Since the RT varies much across of spectra, CV would be a better measure than SD. Using CV, we propose a new measure called the false clustering rate (FCR) which is similar in spirit to that of the false discovery rate (FDR). It measures the rate how often a cluster is composed of spectra from the different peptides. We provide a threshold value of CV, Δ, to determine whether a cluster is well clustered or not. That is, if the value of CV of a given cluster is smaller than Δ, then we call it is a good cluster. For the given value of Δ, FCR can be computed. The detailed procedure of computing FCR is given as follows:
Calculate the coefficient of variation (CV) of spectral RT in the same clusters from the Q-FISH results.
Permute the spectra while maintaining the number of spectra in each cluster fixed.
Calculate CV p for each permuted cluster for the p th permuted sample.
Compute FCR as follows:
where P is the number of permutations, Δ the threshold value, and C the total number of clusters.
For our HCC data, we computed FCR for various values of Δ, as summarized in the Table 5. From our analysis, we chose the value of Δ as 4.4 which yielded FCR close to 0.05.
We also calculated FCR to determine the cut-off value of correlation coefficient, ρ for spectral clustering. For the given threshold value of ρ, FCR can be computed in the similar manner as Δ. We computed FCR for the various values of the given ρ, as summarized in the Table 5. We chose ρ = 0.6 which yielded FCR close to 0.05.
Differentially expressed peptides (DEPs)
To estimate the peptide's abundance found in different samples such as control and disease tissue samples, a spectral counting method like Q-FISH can be employed. Pham et al.  proposed the use of the beta-binomial distribution to test the significance of DEPs in spectral counts in label-free mass spectrometry-based proteomics. Their results revealed that the beta-binomial test can be applied to experiments with one or more replicates, as well as for the comparison of multiple conditions. We applied the beta-binomial model to test the abundance of DEPs in the clustered spectral set through three replicated MS/MS experiments.
Let x denote the number of spectral counts in the clustered spectral set and n, the total number of spectral counts of all spectral in each sample. Then, assume that x is distributed with the true proportion π, 0 ≤ π ≤ 1,
Differently, π is approximated as a random variable based on the beta distribution with real parameters α > 0 and β > 0.
Subsequently, the marginal distribution of x is the beta-binomial distribution ,
where B(·,·) is the beta function.
The following parameterization is used
where h is the inverse of the link function (logit or complementary log-log), X a design matrix, b a vector of fixed effects, η = Xb the linear predictor, and Φ the overdispersion parameter. Based on this parameterization, the marginal mean and variance are:
It should be noted that parameters b and ϕ are estimated by maximizing the log-likelihood of the marginal model. Given the estimated coefficients, the testing hypothesis is rephrased as to whether the b coefficient is 0 . We also used Benjamini and Hochberg's method to correct for multiple comparisons in multiple testing for DEPs .
Hernandez P, Markus M, Appel RD: Automated protein identification by tandem mass spectrometry: issue and strategies. Mass Spectrometry Reviews 2006, 25: 235–254. 10.1002/mas.20068
Nesvizhskii AI, Vitek O, Aebersold R: Analysis and validation of proteomic data generated by tandem mass spectrometry. Nature Methods 2007, 4: 787–797. 10.1038/nmeth1088
Washburn MP, Ulaszek RB, Yates JR: Reproducibility of quantitative proteomic analyses of complex biological mixtures by multidimensional protein identification technology. Anal Chem 2003, 75: 5054–5061. 10.1021/ac034120b
Ong SE, Mann M: Mass spectrometry-based proteomics turns quantitative. Nat Chem Biol 2005, 1: 252–262. 10.1038/nchembio736
Wang M, You J, Bemis KG, Tegeler TJ, Brown DP: Label-free mass spectrometry-based protein quantification technologies in proteomic analysis. Briefings in Functional Genomics and Proteomic 2008, 7(5):329–339. 10.1093/bfgp/eln031
Old WM, Meyer-Arendt K, Aveline-Wolf L, Pierce KG, Mendoza A, Sevinsky JR, Resing KA, Ahn NG: Comparison of label-free methods for quantifying human proteins by shotgun proteomics. Mol Cell Proteomics 2005, 4(10):1487–1502. 10.1074/mcp.M500084-MCP200
Little KM, Lee JK, Ley K: ReSASC: a resampling-based algorithm to determine differential protein expression from spectral count data. Proteomics 2010, 10: 1212–1222. 10.1002/pmic.200900328
Jaffe JD, Mani DR, Leptos KC, Church GM, Gillette MA, Carr SA: PEPPeR, a platform for experimental proteomic pattern recognition. Mol Cell Proteomics 2006, 5: 1927–1941. 10.1074/mcp.M600222-MCP200
Li XJ, Yi EC, Kemp CJ, Zhang H, Aebersold R: A software suite for the generation and comparison of peptide arrays from sets of data collected by liquid chromatography-mass spectrometry. Mol Cell Proteomics 2005, 4: 1328–1340. 10.1074/mcp.M500141-MCP200
Breukelen B, Toorn HW, Drugan MM, Hec AJ: StatQuant: a post-quantification analysis toolbox for improving quantitative mass spectro-metry. Bioinformatics 2009, 25: 1472–1473. 10.1093/bioinformatics/btp181
Mann B, Madera M, Sheng Q, Tang H, Mechref Y, Novotny MV: ProteinQuant Suite: a bundle of automated software tools for label-free quantitative proteomics. Rapid Commun Mass spectrum 2008, 22: 3823–3834. 10.1002/rcm.3781
Zhang H, Yi EC, Li XJ, Mallick P, Kelly-Spratt KS, Masselon CD, Camp DG, Smith RD, Kemp CJ, Aebersold R: High throughput quantitative analysis of serum proteins using glycopeptide capture and liquid chromatography mass spectrometry. Mol Cell Proteomics 2005, 4: 144–155.
Radulovic D, Jelveh S, Ryu S, Hamilton TG, Foss E, Mao Y, Emili A: Informatics platform for global proteomic profiling and biomarker discovery using liquid chromatography-tandem mass spectrometry. Mol Cell Proteomics 2004, 3: 984–997. 10.1074/mcp.M400061-MCP200
Prakash A, Mallick P, Whiteaker J, Zhang H, Paulovich A, Flory M, Lee H, Aebersold R, Schwikowski B: Signal maps for mass spectrometry-based comparative proteomics. Mol Cell Proteomics 2006, 5: 423–432.
Fischer B, Grossmann J, Roth V, Gruissem W, Baginsky S, Buhmann JM: Semi-supervised MC/MS alignment for differential proteomics. Bioinformatics 2006, 22: e132-e140. 10.1093/bioinformatics/btl219
Kapp E, Schutz F: Overview of tandem mass spectrometry (MS/MS) database search algorithms. Current Protocols in Protein Science 2007, 25: 25.2.1–25.2.19.
Nesvizhskii A, Roos FF, Grossmann J, Vogelzang M, Eddes JS, Gruissem W, Baginsky S, Aebersold R: Dynamic spectrum quality assessment and iterative computational analysis of shotgun proteomic data: toward more efficient identification of post-translational modifications, sequence polymorphisms, and novel peptides. Mol Cell Proteomics 2006, 5: 652–670.
Nesvizhskii A: A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. J of Proteomics 2010, 73: 2092–2123. 10.1016/j.jprot.2010.08.009
Lam H, Deutsch E, Eddes J, Eng JK, King N, Stein SE, Aebersold R: Development and validation of a spectral library searching method for peptide identification from MS/MS. Proteomics 2007, 7: 655–667. 10.1002/pmic.200600625
Beer I, Barnea E, Ziv T, Admon A: Improving large-scale proteomics by clustering of mass spectrometry data. Proteomics 2004, 4(4):950–960. 10.1002/pmic.200300652
Pham TV, Piersma SR, Warmoes M, Jimenez CR: On the beta-binomial model for analysis of spectral count data in label-free tandem mass Spectrometry-based proteomics. Bioinformatics 2010, 26(3):363–369. 10.1093/bioinformatics/btp677
Seriramalu R, Pang WW, Jayapalan JJ, Mohamed E, Abdul-Rahman PS, Bustam AZ, Khoo AS, Hashim OH: Application of champedak mannose-binding lectin in the glycoproteomic profiling of serum samples unmasks reduced expression of alpha-2 macroglobulin and complement factor B in patients with nasopharyngeal carcinoma. Electrophoresis 2010, 31: 2388–2395. 10.1002/elps.201000164
Kinoshita M, Miyata M: Underexpression of mRNA in human hepatocellular carcinoma focusing on eight loci. Hepatology 2002, 36(2):433–438. 10.1053/jhep.2002.34851
Fu LY, Jia HL, Dong QZ, Wu JC, Zhao Y, Zhou HJ, Ren N, Ye QH, Qin LX: Suitable reference genes for real-time PCR in human HBV-related hepatocellular carcinoma with different clinical prognoses. BMC Cancer 2009, 9: 49. 10.1186/1471-2407-9-49
Hwang TL, Liang Y, Chien KY, Yu JS: Overexpression and elevated serum levels of phosphoglycerate kinase 1 in pancreatic ductal adenocarcinoma. Proteomics 2006, 6(7):2259–2272. 10.1002/pmic.200500345
Li Y, Wan D, Wei W, Su J, Cao J, Qiu X, Ou C, Ban K, Yang C, Yue H: Candidate genes responsible for human hepatocellular carcinoma identified from differentially expressed genes in hepatocarcinogenesis of the tree shrew (Tupaia belangeri chinesis). Hepatol Res 2008, 38(1):85–89. 10.1111/j.1872-034X.2007.00207.x
Elchuri S, Naeemuddin M, Sharpe O, Robinson WH, Huang TT: Identification of biomarkers associated with the development of hepatocellular carcinoma in CuZn superoxide dismutase deficient mice. Proteomics 2007, 7(12):2121–2129. 10.1002/pmic.200601011
Na K, Lee EY, Lee HJ, Kim KY, Lee H, Jeong SK, Jeong AS, Cho SY, Kim SA, Song SY, Kim KS, Cho SW, Kim H, Paik YK: Human plasma carboxylesterase 1, a novel serologic biomarker candidate for hepatocellular carcinoma. Proteomics 2009, 9: 3989–3999. 10.1002/pmic.200900105
Demers M, Rose AA, Grosset AA, Biron-Pain K, Gaboury L, Siegel PM, St-Pierre Y: Overexpression of galectin-7, a myoepithelial cell marker, enhances spontaneous metastasis of breast cancer cells. Am J Pathol 2010, 176(6):3023–3031. 10.2353/ajpath.2010.090876
Schulz DM, Böllner C, Thomas G, Atkinson M, Esposito I, Höfler H, Aubele M: Identification of differentially expressed proteins in triple-negative breast carcinomas using DIGE and mass spectrometry. J Proteome Res 2009, 8(7):3430–3438. 10.1021/pr900071h
Pau Ni IB, Zakaria Z, Muhammad R, Abdullah N, Ibrahim N, Aina Emran N, Hisham Abdullah N, Syed Hussain SN: Gene expression patterns distinguish breast carcinomas from normal breast tissues: the Malaysian context. Pathol Res Pract 2010, 206(4):223–228. 10.1016/j.prp.2009.11.006
Yue W, Sun LY, Li CH, Zhang LX, Pei XT: Screening and identification of ovarian carcinomas related genes. Ai Zheng 2004, 23(2):141–145.
Hirata T, Yamamoto H, Taniguchi H, Horiuchi S, Oki M, Adachi Y, Imai K, Shinomura Y: Characterization of the immune escape phenotype of human gastric cancers with and without high-frequency microsatellite instability. J Pathol 2007, 211(5):516–523. 10.1002/path.2142
Yusenko MV, Ruppert T, Kovacs G: Analysis of differentially expressed mitochondrial proteins in chromophobe renal cell carcinomas and renal oncocytomas by 2-D gel electrophoresis. Int J Biol Sci 2010, 6(3):213–224.
Fu Y, Xiu LY, Jia W, Ye D, Sun RX, Qian XH, He SM: DeltAMT: A Statistical Algorithm for Fast Detection of Protein Modifications From LC-MS/MS Data. Mol Cell Proteomics 2011, 10(5):M110.000455. 10.1074/mcp.M110.000455
Tanner S, Shu H, Frank A, Wang LC, Zandi E, Mumby M, Pevzner PA, Bafna V: InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. Anal Chem 2005, 77(14):4626–39. 10.1021/ac050102d
Ye D, Fu Y, Sun RX, Wang HP, Yuan ZF, Chi H, He SM: Open MS/MS spectral library search to identify unanticipated post-translational modifications and increase spectral identification rate. Bioinformatics 2010, 26(12):i399–406. 10.1093/bioinformatics/btq185
Lee HJ, Kang MJ, Lee EY, Cho SY, Kim H, Paik YK: Application of a peptide-based PF2D platform for quantitative proteomics in disease biomarker discovery. Proteomics 2008, 8(16):3371–3381. 10.1002/pmic.200800111
Kapp EA, Schütz F, Connolly LM, Chakel JA, Meza JE, Miller CA, Fenyo D, Eng JK, Adkins JN, Omenn GS, Simpson RJ: An evaluation, comparison, and accurate benchmarking of several publicly available MS/MS search algorithms: sensitivity and specificity analysis. Proteomics 2005, 5(13):3475–3490. 10.1002/pmic.200500126
Lee HJ, Na K, Kwon MS, Park T, Kim KS, Kim H, Paik YK: A new versatile peptide-based size exclusion chromatography platform for global profiling and quantitation of candidate biomarkers in hepatocellular carcinoma specimens. Proteomics 2011, 11: 1976–1984. 10.1002/pmic.201100002
Malcolm SB, Peter K: Mass spectral compatibility of four proteomics stains. Journal of Proteome Research 2007, 6: 4313–4320. 10.1021/pr070398z
Lam H, Aebersol R: Using spectral libraries for peptide identification from tandem mass spectrometry (MS/MS) data. Curr Protoc Protein Sci 2010, 60: 25.5.1–25.5.9.
Skellam JG: A probability distribution derived from the binomial distribution by regarding the probability of success as variable between the sets of trials. J R Stat Soc Ser B (Methodol) 1948, 10: 257–261.
Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Statist Soc Ser B 1995, 57: 289–300.
The work of TP was supported by the National Research Foundation (KRF-2008-313-C00086) and the Brain Korea 21 Project of the Ministry of Education. The work of JKL was supported in part by the US National Institutes of Health (R01HL081690).
The authors declare that they have no competing interests.
SML and MSK performed the statistical analysis and drafted the manuscript. HJL, YKP, and HT carried out mass spectrometry experiments. JKL and TP conceived of the study, and participated in coordination. All authors write, read and approved the final manuscript.
Seungmook Lee, Min-Seok Kwon contributed equally to this work.
Electronic supplementary material
Additional file 1:. In order to compare the performance of Q-FISH with the spectral counting method by SEQUEST, we used the human liver data and validated the results through literature search. For the human liver data, Q-FISH provided 1571 differentially expressed clusters for HCC sample and 1556 for normal sample, among which 57 and 99 clusters were identified by SEQUEST in HCC and normal samples, respectively. On the other hand, SEQUEST provided 93 and 145 peptides for HCC and normal tissue samples, respectively. (XLS 84 KB)
About this article
Cite this article
Lee, S., Kwon, M., Lee, H. et al. Enhanced peptide quantification using spectral count clustering and cluster abundance. BMC Bioinformatics 12, 423 (2011). https://doi.org/10.1186/1471-2105-12-423
- Reference Spectrum
- Peptide Identification
- Spectral Count
- Normal Tissue Sample
- Human Liver Cancer