- Software
- Open access
- Published:
AYUMS: an algorithm for completely automatic quantitation based on LC-MS/MS proteome data and its application to the analysis of signal transduction
BMC Bioinformatics volume 8, Article number: 15 (2007)
Abstract
Background
Comprehensive description of the behavior of cellular components in a quantitative manner is essential for systematic understanding of biological events. Recent LC-MS/MS (tandem mass spectrometry coupled with liquid chromatography) technology, in combination with the SILAC (Stable Isotope Labeling by Amino acids in Cell culture) method, has enabled us to make relative quantitation at the proteome level. The recent report by Blagoev et al. (Nat. Biotechnol., 22, 1139–1145, 2004) indicated that this method was also applicable for the time-course analysis of cellular signaling events. Relative quatitation can easily be performed by calculating the ratio of peak intensities corresponding to differentially labeled peptides in the MS spectrum. As currently available software requires some GUI applications and is time-consuming, it is not suitable for processing large-scale proteome data.
Results
To resolve this difficulty, we developed an algorithm that automatically detects the peaks in each spectrum. Using this algorithm, we developed a software tool named AYUMS that automatically identifies the peaks corresponding to differentially labeled peptides, compares these peaks, calculates each of the peak ratios in mixed samples, and integrates them into one data sheet. This software has enabled us to dramatically save time for generation of the final report.
Conclusion
AYUMS is a useful software tool for comprehensive quantitation of the proteome data generated by LC-MS/MS analysis. This software was developed using Java and runs on Linux, Windows, and Mac OS X. Please contact ayums@ims.u-tokyo.ac.jp if you are interested in the application. The project web page is http://www.csml.org/ayums/.
Background
The LC-MS/MS system is one of the most frequently used instruments for shotgun protein identification [1–6]. Protein identification by LC-MS/MS analysis consists mainly of the following five steps: (i) The samples are prepared from protein mixtures by peptide fragmentation with a protease, e.g., trypsin. (ii) In the LC column, the digested peptides are separated according to their hydrophobicity and/or polarity (iii) In the survey scan (MS-1) mode, the peptides eluted from the LC system are continuously introduced into the mass spectrometer by electrospray ionization (ESI). (iv) The detector in the MS-1 mode separates peptides according to the mass/charge ratio (m/z) and selects the peaks with high intensity. (v) In the MS/MS (MS-2) mode, the selected peptides are separated from other components and randomly fragmented by physical impact. The detector integrates the intensity of each fragment, leading to the generation of MS/MS spectra.
Recent development of quantitative proteomics technology has made it possible to perform quantitative analysis of large-scale proteome data generated using the LC-MS/MS system. SILAC (Stable Isotope Labeling by Amino acids in Cell culture) is one of the most effective methods for comparative analysis of the expression status of proteins among samples [7–10], including time-course analysis [11]. The SILAC method has undergone some modifications. One of the well-modified SILAC methods is as follows: (i) Target cells are incubated in three types of media, namely, media containing (1) natural arginine, (2) arginine containing stable isotope of 13C, or (3) arginine with two types of stable isotopes, 13C and 15N. (ii) The samples prepared from differentially labeled cells are mixed in equal proportions and introduced into the LC-MS/MS system. (iii) The peak derived from the same amino acid sequence is shifted in proportion to the difference of the number of neutrons between the samples. Relative quantitation can be performed by comparing the peak intensities of differentially labeled peptides [11].
The above method is widely used for describing various biological events [10–12]. For example, Blagoev et al. reported the global quantitative dynamics of phosphotyrosine-based signaling events by measuring the fold activation of related proteins at different time points [11].
Several types of software, e.g., SEQUEST [13], MOWSE [14], Mascot [15], ProteinProspector [16], and ProFound [17], have been developed for protein identification based on MS or MS/MS data. These software tools deduce a corresponding protein/peptide sequence from the measured data and generate a report with additional information, e.g. reliability score, gene ID, and modification if any. For quantitation, MZmine version 0.60 was developed for differential analyses of the LC/MS profile data [18]. Although this software uses a GUI interface with a powerful batch-processing function, its application is restricted to the analyses of LC/MS data. For further analyses using LC-MS/MS in combination with the SILAC method, MSQuant [19] has been developed. MSQuant has a GUI interface and runs on Windows OS. However, this software is not in stable operation and requires a huge memory (e.g., 2 GB) to run.
In the present study, we have developed a completely automatic console-based software tool that is highly customized for LC-MS/MS proteome data obtained by the SILAC method. Here we report a new algorithm for peak detection, details of the data analysis pipeline, and a new platform-independent open source software, AYUMS, developed using this algorithm. Furthermore, we compare the results obtained by manual operation with those obtained using this software and discuss the respective performances.
Implementation
AYUMS consists of a series of steps for processing LC-MS/MS data. The scope of this software is focused on data processing for extracting quantitative information from the raw data. Therefore, other tools should be used for the statistical analyses based on the information produced by AYUMS. This software is implemented as a stand-alone Java application and requires JRE 1.4.2 or higher version. In contrast to MSQuant (which runs only on Windows), AYUMS is platform-independent, i.e., it runs on any of Windows, Unix, or Mac OS X. In addition, the generation of the final report is completely automatic.
Software design
Our aim was to develop a software tool that automatically executes the calculation of the peak ratios of differentially labeled peptides analyzed by LC-MS/MS. To achieve this, we adopted a console-based user interface (CUI). AYUMS requires two input files – an LC-MS/MS raw data file and a database search result file containing the information on the identified peptides/proteins. AYUMS generates an output report in a comma-separated value (CSV) format. The flow chart of AYUMS is shown in Figure 1 and the contents of the flow chart are described in the following sections.
Input data style and conversion of the raw data file
In the first stage (Stage 1 in Figure 1), AYUMS requires two files, namely, (i) a Mascot HTML file and (ii) a binary file in our original format (ayums format). For generating the Mascot HTML file, a peak list file is first prepared from the raw MS/MS data file using ProteinLynx (Micromass, UK). This peak list is searched against the protein database using Mascot (Matrix Science, UK) and the output of the database search is saved as an HTML file. The binary file is generated by the following two steps: first, the MassLynx raw data are converted to ASCII style data using Databridge in the MassLynx package (the format is shown in Figure 2); subsequently, this ASCII data file is converted to the ayums format using the conversion functions in AYUMS. Using a Pentium 4 (3.0 GHz) processor, the total time required for the conversion from the raw data to the ASCII style by Databridge is 30 min to 1 h, and the time from the ASCII style to the ayums format by AYUMS is 3 to 6 h.
Parsing of Mascot HTML
The Mascot HTML file mainly comprises a list of inferred proteins and their peptides along with the information on the observed molecular weight, the calculated molecular weight, the difference between these two weights, probability-based Mowse score, p-value of the score, rank of the matched ion, peptide sequence, and MS/MS spectrum. In Stage 2, the Mascot HTML file is parsed to make these data available in AYUMS. The CyberNeko Java library developed by Andy Clark is used as an HTML parser [20]. If the XML format is implemented for the output of Mascot, an XML parser library will also be useful.
Selection of reliable proteins and their peptides
In Stage 3, every matched protein and the list of identified peptides under the defined conditions are extracted from the parsed results of the Mascot HTML data. The criteria for data extraction are as follows: (i) select protein/peptides with a Mascot score higher than a threshold value, (ii) select peptides in higher ranks than a threshold value. The default condition in AYUMS is set to select all the peptides with a score higher than 25 in the top rank.
Peak detection and computation
In Stage 4, the peaks corresponding to the selected peptides are searched from the raw data and the peak ratios of the differentially labeled peptides are calculated.
The following five steps are applied for each selected peptide.
Step 1
Based on the Mascot data of the selected peptide, the retention time at LC and the m/z value of the peptide are searched from the ayums format of MS-2.
Step 2
According to the information on the retention time obtained in Step 1, the nearest time point is searched from the ayums format of MS-1, leading to the acquisition of the spectrum corresponding to the target peptide.
Step 3
The spectra around this time point are sequentially searched. A specific algorithm, the details of which are described below, calculates a score for each spectrum and selects the best spectrum.
The spectrum consists of a set of peaks with each individual m/z value and intensity. All the intensities within a certain range of m/z value (default 0.1) from the target peak are integrated. Each peptide is differentially displayed in three distinct forms that are derived from three types of stably labeled arginine (12C14N, 13C14N, and 13C15N). According to the information in the Mascot result, the identified peptide form and its differentially labeled ones are specified in the spectrum based on the principle that the differences of molecular weight between 12C14N - 13C14N and 13C14N - 13C15N are 6Da and 4Da, respectively (Figure 3).
In addition, as proteins/peptides are made of some natural isotopes, each peak is accompanied by sub-peaks which shift 1 Da and 2 Da in the spectrum. The intensities of these peaks are all integrated as the total quantity of the target peptide.
Step 4
The spectra adjacent to the best spectrum are recursively selected as long as the score ratio of the investigated spectrum to the best one is higher than a constant value (default 0.8), which we term the acceptable ratio. Based on the data of the acceptable spectra, the intensities for three types of differentially labeled peptides are independently integrated.
Step 5
Based on the result in Step 4, the average ratios of 13C14N and 13C15N to 12C14N and their standard deviations are calculated.
Algorithm
The procedure for Step 1 to Step 5 is described in the following algorithm.
n := 1.008665
r := 0.1
r2 := 5.000
r3 := 3
r4 := 10
r5 := -0.2
for s ∈ S: set of protein
for {(f i , n i , c i )|0≤i≤N} ∈ F(s): F is a function from a protein to the fragments of the protein, the scan number of the MS/MS experiment, and charge of each fragment.
(r ms/ms , mz ms ) := R ms/ms (n i ) : R ms/ms is a function from a scan number of the MS/MS experiment to the MS/MS retention time and m/z value of MS experiment; these can be obtained from the raw data.
(r ms , n ms ) := R ms (r ms/ms ): R ms is a function from an MS/MS retention time to the nearest MS retention time and its scan number.
emax = 0, , mmax = 0, tmax = ()
for {m|n ms - r3 ≤ m ≤ n ms + r3}
(, L rate,m ) := sub(m, mz ms , c i , f i ): calculate the total intensities of a peak and its ratio in the spectrum.
if emax <L rate,m
emax := L rate,m , mmax := m
tmax = := ()
end
end
T = {tmax}
for {m|mmax + 1 ≤ m ≤ mmax + r4}
t = (, L rate,m ) := sub(m, mz ms , c i , f i )
if emax × (1 + r5) ≤ L rate,m
add t to T
else
break
end
end
for {m|1≤m≤r4}
t = (, L rate,m ) := sub(mmax - m, mz ms , c i , f i )
if emax × (1 + r5) ≤ L rate,m
add t to T
else
break
end
end
end
: is the ratio of the amount of the wild type to the 13C14N form.
: is the ratio of the amount of the wild type to the 13C15N form.
:= Standard deviation of {|0≤i≤N}
:= Standard deviation of {|0≤i≤N}
end
function sub(n ms , mz ms , c, f)
L = {(t m/z,j , p j )|0≤j≤M} := P(n ms ): P is a function from an MS scan number to the set of m/z and its intensity values. This set can be searched from the raw data.
R := the number of arginine in f
if f contains 13C and does not contain 15N
:= mz ms + 4nR/c
:= mz ms - 6nR/c
:= mz ms
else if f contains 13C and 15N
:= mz ms - 4nR/c
:= mz ms - 10nR/c
:= mz ms
end
:= peakIntensitySet(, L, r, c)
:= peakIntensitySet(, L, r, c)
:= peakIntensitySet(, L, r, c)
L total := select all (t, p) ∈ L with [ - r2 ≤ t ≤ + r2]
return ()
end
function peakIntensitySet(m z , L, r, c)
L':= select all (t, p) ∈ L with [m z - r ≤ t ≤ m z + r]
L'':= select all (t, p) ∈ L with [m z - r + n/c ≤ t ≤ m z + r + n/c]
L''':= select all (t, p) ∈ L with [m z - r + 2n/c ≤ t ≤ m z + r + 2n/c]
return
end
Results output
In Stage 4, AYUMS generates a report in the CSV file format, as shown in Figure 4. The contents of the report are also described in the legend for Figure 4.
Results
Comparison of the machine operation with the manual operation
In order to evaluate the performance of the automatic calculation by AYUMS, we used three sets of time-series data on the phosphotyrosine-related proteome. A431 Cells differentially labeled with stable isotopes of arginine were stimulated with epidermal growth factor (EGF) for different time periods, followed by affinity-purification of signaling molecules with anti-phosphotyrosine antibodies. After direct digestion of the proteins, protein identification and quantification were performed by nanoLC-MS/MS analysis (nanoLC: Dina-2A [KYA Technologies]; tandem mass spectrometer: Q-Tof-2 [Micromass]). Figure 5 shows the activation profile of phosphorylated proteins with the top six Mascot scores (AHNAK nucleoprotein, EGFR, catenin, villin 2, alpha 1 type XVII collagen, and junction plakoglobin). Figures 5(a) and 5(b) show the results obtained by manual operation and by AYUMS, respectively. From the experimental data, 100 proteins were detected by database search against the RefSeq human protein database (NCBI). In the pre-process, our algorithm removed 62 proteins with a Mascot score less than the threshold (default; 25). The remaining 38 phosphorylated proteins were then quantified by manual operation as well as by AYUMS. As shown in Figure 5(c), the results obtained by these two methods showed good correlation (R = 0.890).
Although the results for some proteins did not correlate well (for example, the value for villin 2 obtained by AYUMS is lower than that obtained by manual operation), the shapes of the activation change between the two methods matched each other in most cases. It should be noted that AYUMS enabled us to eliminate the necessity for manual operation. In other words, reliable quantitation results were obtained in a high-throughput fashion that had never been achieved previously.
The poor correlation for some proteins was mainly due to the existence of noise peaks. The background noise has a substantial influence on quantitation, especially in the case of low-abundance peaks. The contaminant noise derived from other peptides also affects the calculation. Although our instrument operates with high mass resolution (10,000 FWHM) and accuracy (50~100 ppm), it is difficult to distinguish the other peaks with adjacent m/z values. Although it is possible to remove unreliable data when performing analysis manually, our algorithm does not have a function to eliminate them efficiently. Some statistical methods are necessary to deal with this problem.
Discussion
Reduction of difficulties
The major contributions of this study are as follows: (i) drastic reduction in the manual work required to perform quantitation for large-scale proteome data and (ii) reproducibility of high-quality data that does not depend on the user. In the case of this study, it required 2–6 working days to create the activation profile of the phosphotyrosine-related proteome by manual operation. In contrast, AYUMS could automatically generate the final report within 6 hours using a single machine. It is also possible to perform quantitation in parallel for multiple experimental data. For example, if two machines are available, 3 hours are sufficient for the generation of the final result.
Once the ayums format file is created, the subsequent analysis can be completed within 15 minutes. Thus, it is possible to easily re-evaluate experimental data by changing various options such as the acceptable ratio in Step 4 of Stage 3 and the threshold of the Mascot score.
Future studies
Although a completely automatic quantitation based on the LC-MS/MS data was realized using AYUMS 1.0, further development of this software is required at various points. First, although the input of Stage 1 in AYUMS supports only the Q-tof type raw data, it needs to handle major data formats by NetCDF for more general purposes. Second, it would be very helpful to generate the final result not only in the CSV file format but also in other major formats, such as mzXML [21], for better usability.
The present SILAC method enables us to compare only two or three samples in a single experiment. Relative quantitation of target proteins at multiple points such as in dynamics analysis requires a common standard point to normalize the results of separate experiments. AYUMS will need to support a function of statistical data processing of the normalized results for more precise quantitation.
Although AYUMS is customized for the SILAC method, it can also easily handle the data obtained by other labeling strategies such as isotope-coded affinity tags (ICAT) [22], isobaric tags for relative and absolute quantitation (iTRAQ) [23], and culture-derived isotope tags (CDIT) [24].
This software is open to public access; hence, any researcher can contribute to the development of its application.
Conclusion
AYUMS is a useful software tool for quantitative proteomics by LC-MS/MS technology in combination with the SILAC method. This software completely eliminates the need for manual work that has always been required previously. Besides, it enables us to obtain the final result considerably faster than by manual operation. Our evaluation of the output data by AYUMS indicated that it ranked comparably with the results calculated by an expert in proteomics.
Availability and requirements
-
Project home page: http://www.csml.org/ayums/
-
Operating system(s): Java platform independent
-
Programming language: Java
-
Other requirements: Java 1.4.2 or higher, CyberNeko HTML Parser 0.9.5 or higher
-
License: AYUMS software is available from the authors at ayums@ims.u-tokyo.ac.jp.
-
Any restrictions to use by non-academics: Need contract.
References
Aebersold R, Mann M: Mass spectrometry-based proteomics. Nature 2003, 422: 198–207. 10.1038/nature01511
Patterson SD, Aebersold RH: Proteomics: the first decade and beyond. Nat Genet 2003, 33(Suppl):311–323. 10.1038/ng1106
Taylor SW, Fahy E, Ghosh SS: Global organellar proteomics. Trends Biotechnol 2003, 21: 82–88. 10.1016/S0167-7799(02)00037-9
Oyama M, Itagaki C, Hata H, Suzuki Y, Izumi T, Natsume T, Isobe T, Sugano S: Analysis of small human proteins reveals the translation of upstream open reading frames of mRNAs. Genome Res 2004, 14: 2048–2052. 10.1101/gr.2384604
Washburn MP, Wolters D, Yates JR 3rd: Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat Biotechnol 2001, 19: 242–247. 10.1038/85686
Kaji H, Saito H, Yamauchi Y, Shinkawa T, Taoka M, Hirabayashi J, Kasai K, Takahashi N, Isobe T: Lectin affinity capture, isotope-coded tagging and mass spectrometry to identify N-linked glycoproteins. Nat Biotechnol 2003, 21: 627–629. 10.1038/nbt829
Ong SE, Blagoev B, Kratchmarova I, Kristensen DB, Steen H, Pandey A, Mann M: Stable isotope labeling by amino acids in cell culture, SILAC, as a simple and accurate approach to expression proteomics. Mol Cell Proteomics 2002, 1: 376–386. 10.1074/mcp.M200025-MCP200
de Godoy LM, Olsen JV, de Souza GA, Li G, Mortensen P, Mann M: Status of complete proteome analysis by mass spectrometry: SILAC labeled yeast as a model system. Genome Biol 2005, 7: R50. 10.1186/gb-2006-7-6-r50
Gruhler A, Schulze WX, Matthiesen R, Mann M, Jensen ON: Stable isotope labeling of Arabidopsis thaliana cells and quantitative proteomics by mass spectrometry. Mol Cell Proteomics 2005, 4: 1697–1709. 10.1074/mcp.M500190-MCP200
Foster LJ, Rudich A, Talior I, Patel N, Huang X, Furtado LM, Bilan PJ, Mann M, Klip A: Insulin-dependent interactions of proteins with GLUT4 revealed through stable isotope labeling by amino acids in cell culture (SILAC). J Proteome Res 2006, 5: 64–75. 10.1021/pr0502626
Blagoev B, Ong SE, Kratchmarova I, Mann M: Temporal analysis of phosphotyrosine-dependent signaling networks by quantitative proteomics. Nat Biotechnol 2004, 22: 1139–1145. 10.1038/nbt1005
Romijn EP, Christis C, Wieffer M, Gouw JW, Fullaondo A, van der Sluijs P, Braakman I, Heck AJ: Expression clustering reveals detailed co-expression patterns of functionally related proteins during B cell differentiation: a proteomic study using a combination of one-dimensional gel electrophoresis, LC-MS/MS, and stable isotope labeling by amino acids in cell culture (SILAC). Mol Cell Proteomics 2005, 4: 1297–1310. 10.1074/mcp.M500123-MCP200
Yates JR 3rd, McCormack AL, Link AJ, Schieltz D, Eng J, Hays L: Future prospects for the analysis of complex biological systems using micro-column liquid chromatography-electrospray tandem mass spectrometry. Analyst 1996, 121: 65R-76R. 10.1039/an996210065r
Pappin DJ, Hojrup P, Bleasby AJ: Rapid identification of proteins by peptide-mass fingerprinting. Curr Biol 1993, 3: 327–332. 10.1016/0960-9822(93)90195-T
Perkins DN, Pappin DJ, Creasy DM, Cottrell JS: Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20: 3551–3567. 10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2
Clauser KR, Baker P, Burlingame AL: Role of accurate mass measurement (+/- 10 ppm) in protein identification strategies employing MS or MS/MS and database searching. Anal Chem 1999, 71: 2871–2882. 10.1021/ac9810516
Zhang W, Chait BT: ProFound: An expert system for protein identification using mass spectrometric peptide mapping information. Anal Chem 2000, 72: 2482–2489. 10.1021/ac991363o
Katajamaa M, Oresic M: Processing methods for differential analysis of LC/MS profile data. BMC Bioinformatics 2005, 6: 179. 10.1186/1471-2105-6-179
Schulze WX, Mann M: A novel proteomic screen for peptide-protein interactions. J Biol Chem 2004, 279: 10756–10764. 10.1074/jbc.M309909200
CyberNeko HTML Parser[http://people.apache.org/~andyc/neko/doc/html/]
Pedrioli PG, Eng JK, Hubley R, Vogelzang M, Deutsch EW, Raught B, Pratt B, Nilsson E, Angeletti RH, Apweiler R, Cheung K, Costello CE, Hermjakob H, Huang S, Julian RK, Kapp E, McComb ME, Oliver SG, Omenn G, Paton NW, Simpson R, Smith R, Taylor CF, Zhu W, Aebersold R: A common open representation of mass spectrometry data and its application to proteomics research. Nat Biotechnol 2004, 22: 1459–1466. 10.1038/nbt1031
Gygi SP, Rist B, Gerber SA, Turecek F, Gelb MH, Aebersold R: Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nat Biotechnol 1999, 17: 994–999. 10.1038/13690
Ross PL, Huang YN, Marchese JN, Williamson B, Parker K, Hattan S, Khainovski N, Pillai S, Dey S, Daniels S, Purkayastha S, Juhasz P, Martin S, Bartlet-Jones M, He F, Jacobson A, Pappin D: Multiplexed protein quantitation in Saccharomyces cerevisiae using amine-reactive isobaric tagging reagents. Mol Cell Proteomics 2004, 3: 1154–1169. 10.1074/mcp.M400129-MCP200
Ishihama Y, Sato T, Tabata T, Miyamoto N, Sagane K, Nagasu T, Oda Y: Quantitative mouse brain proteomics using culture-derived isotope tags as internal standards. Nat Biotechnol 2005, 23: 617–621. 10.1038/nbt1086
Acknowledgements
We are grateful to E. Nakajima for critical reading of the manuscript. This work was supported by the Japan Science and Technology Agency.
Author information
Authors and Affiliations
Corresponding author
Additional information
Authors' contributions
AS developed the new algorithms for peak recognition, operated the software, and wrote the manuscript. MN developed the new algorithms for peak recognition, helped to implement the algorithms, operate the software and prepare the manuscript. MO initiated this study, provided knowledge about the structure of the input raw data and wrote the manuscript. HK-H performed the experiment and helped to operate the software. KS provided knowledge about biochemistry. SS provided knowledge about proteomics technology. TY provided knowledge about signal transduction. SM supervised the dry study. All the authors read and approved of the final manuscript.
Ayumu Saito, Masao Nagasaki, Masaaki Oyama contributed equally to this work.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Saito, A., Nagasaki, M., Oyama, M. et al. AYUMS: an algorithm for completely automatic quantitation based on LC-MS/MS proteome data and its application to the analysis of signal transduction. BMC Bioinformatics 8, 15 (2007). https://doi.org/10.1186/1471-2105-8-15
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/1471-2105-8-15