A database application for pre-processing, storage and comparison of mass spectra derived from patients and controls
© Titulaer et al; licensee BioMed Central Ltd. 2006
Received: 13 April 2006
Accepted: 05 September 2006
Published: 05 September 2006
Statistical comparison of peptide profiles in biomarker discovery requires fast, user-friendly software for high throughput data analysis. Important features are flexibility in changing input variables and statistical analysis of peptides that are differentially expressed between patient and control groups. In addition, integration the mass spectrometry data with the results of other experiments, such as microarray analysis, and information from other databases requires a central storage of the profile matrix, where protein id's can be added to peptide masses of interest.
A new database application is presented, to detect and identify significantly differentially expressed peptides in peptide profiles obtained from body fluids of patient and control groups. The presented modular software is capable of central storage of mass spectra and results in fast analysis. The software architecture consists of 4 pillars, 1) a Graphical User Interface written in Java, 2) a MySQL database, which contains all metadata, such as experiment numbers and sample codes, 3) a FTP (File Transport Protocol) server to store all raw mass spectrometry files and processed data, and 4) the software package R, which is used for modular statistical calculations, such as the Wilcoxon-Mann-Whitney rank sum test. Statistic analysis by the Wilcoxon-Mann-Whitney test in R demonstrates that peptide-profiles of two patient groups 1) breast cancer patients with leptomeningeal metastases and 2) prostate cancer patients in end stage disease can be distinguished from those of control groups.
The database application is capable to distinguish patient Matrix Assisted Laser Desorption Ionization (MALDI-TOF) peptide profiles from control groups using large size datasets. The modular architecture of the application makes it possible to adapt the application to handle also large sized data from MS/MS- and Fourier Transform Ion Cyclotron Resonance (FT-ICR) mass spectrometry experiments. It is expected that the higher resolution and mass accuracy of the FT-ICR mass spectrometry prevents the clustering of peaks of different peptides and allows the identification of differentially expressed proteins from the peptide profiles.
In mass spectrometry (MS), analysis of mass spectra is possible with various software packages. In general these software applications work fine for the analysis of individual spectra, but lack the ability to compare very large number of spectra and address differences in (peptide) profile masses to certain groups, such as patient and control groups. Therefore, it is necessary to have fast, user-friendly software for high throughput data pre processing, flexibility in changing input variables and statistical tools to analyze the peptides that are significantly differentially expressed between the patient and control groups. Statistical calculations are performed within seconds to at most several hours. To the best of our knowledge the only open source project that is capable of peptide profiling with raw MS fid (free induction decay) files (Bruker Daltonics, Germany) is the RProteomics 3-tier architecture of the Cancer Biomedical Informatics Grid, presented in a concurrent versions system (cabigcvs.nci.nih.gov). In the RProteomics project, the main development language is R and the application has a web interface. This paper describes an application where MS data preprocessing is expanded with a kind of Laboratory Information Management Systems (LIMS). It requires no grid architecture, can even be installed on a stand-alone computer, and due to local file interfaces can easily be integrated with commercial statistical software packages visualization applications, such as Spotfire™  and Omniviz™. The presented software architecture is capable of central storage of mass spectra and analysis results. A central database holds all meta-data. Meta-data consist of the origin of the measured samples, experiments performed on different mass spectrometers and allocation of samples to different groups. Meta-data can also link the experimental results to clinical information. Information from the database can be retrieved with Structured Query Language (SQL) and can be linked to other databases on common keys, such as patient code. In this study, the application is built in fast Java code, which provides an excellent GUI, and statistic R routines are called if needed. In addition, the protein origin of the significant peptide masses can be identified by comparing the centrally stored peptide masses of interest with those calculated from the human mass spectrometry protein sequence database (for example MSDB) or by mass spectrometry assisted sequencing. Both identification techniques use the Mascot™ search engine . The platform independent software architecture is tested on two sets of data: 1) Mass spectrometry (MS) files of cerebrospinal fluid (CSF) samples from patients with breast cancer, breast cancer with leptomeningeal metastasis (LM) and a control group ; and 2) MS files of serum samples from patients with prostate cancer in end stage disease and a control group.
CSF samples of breast cancer patients
Quality report. An example of a quality report, generated by the software, when a profile matrix is created.
experimentname : cerebrospinal fluid tryptic peptide profiles to diagnose leptomeningeal metastasis in breast cancer patients
Find maxima within distance (Da) : 0.50
Quantile threshold peak finding : 0.98
Combine peaks within distance (Da) : 0.50
Noisy spectra contain peak numbers > : 450
Maximum mass spectra per sample : 4
threshold binary table 0 < 1 > = : 2
Minimum mass (Da) : 800
Maximum mass (Da) : 4000
Calibration : yes
number of masses in matrix : 1949
group : control group without breast cancer(group1_control)
total number of samples : 45
number of samples in matrix : 32
number of samples with too low number of replicates : 12
number of samples not in matrix that because all spectra cannot not be calibrated or are all too noisy : 1
(total number of 6 sample(s) with at least one noisy spectrum)
group : breast cancer without leptomeningeal metastasis(group2_breast_cancer)
total number of samples : 52
number of samples in matrix : 39
number of samples with too low number of replicates : 13
number of samples not in matrix that because all spectra cannot not be calibrated or are all too noisy : 0
(total number of 5 sample(s) with at least one noisy spectrum)
group : breast cancer with leptomeningeal metastasis(group3_leptomeningeal_metastasis)
total number of samples : 54
number of samples in matrix : 40
number of samples with too low number of replicates : 12
number of samples not in matrix that because all spectra cannot not be calibrated or are all too noisy : 2
(total number of 8 sample(s) with at least one noisy spectrum)
Serum samples of prostate cancer patients
About 7 ml blood from 27 patients and 27 controls are collected in clotting tubes and stored at room temperature for a period of 2 h. Subsequently, the tubes are centrifuged at 1000 rpm for a period of 10 minutes, and the supernatant is collected and stored at a temperature of -80°C. The serum is tryptic digested and incubated overnight at a temperature of 37°C with a 1:10 ratio with a Promega Trypsin Gold stock solution, with a concentration of 100 μg/ml. In total 5 μl of the digested sample is bound to Magnetic Beads MB-HIC C18. The beads are eluted with 10 μl of 50% acetonitrile in Milli-Q water. An amount of 0.5 μl of the eluted fraction is spotted 4 times on the anchor chip and measured on an Ultraflex™ MALDI-TOF (Bruker Daltonics, Germany) in reflection mode, which gives 4 replicate spectra for each sample. The mass spectra internally calibrated on at least 4 of the 5 omnipresent peptide masses, which are different from those of the CSF experiment. A somewhat higher noise threshold than in the CSF samples is chosen of 600 peaks in a spectrum.
Software architecture, packages and interfacing
GUI components and functions
A small storage size of the files on the FTP server is guarantied, due to the fid format of MS spectra, a byte array of 92000 channel intensities. The TOF, time, can be calculated from the MS channel number, i, in the fid files by
time i = DELAY + (i·DW) i = 1,2,...,92000 (1)
The values of the constants DW (dwell time) and DELAY are stored in the acqus and acqu files, which are also transported to the FTP server. Other important values are those of the ML1, ML2 and ML3 calibration constants in the acqus files, which are used to calculate the peptide masses from the TOF. Theoretically, the square root of the mass over charge, , is proportional with the TOF, time.
Therefore, the value of constant B is about 40.000 times larger than the value of constant A, where A = ML 3, B = , and C(time i ) = (ML 2 – time i ). The mass over charge is
A peak list consists of mass over charge (m/z), channel number i, and intensity. It is constructed from the data in the raw fid files. A histogram of the number of channels with a specific intensity can be constructed. The integral under the distribution curve represents the amount of 92000 instrument channels. From this distribution curve, the R quantile function calculates an intensity threshold, where the probability is 98 % to find channels with a lower intensity. The effect of changing R quantile percentages between 97 and 99 % in the create matrix GUI (Figure 4) is examined. The MS peaks are expected to be in the channels numbers, i, with intensity higher than this threshold, namely in the range of the 3 % highest intensities. The peak finding algorithm determines the highest channel intensity within a certain mass over charge (m/z) window, for example 0.5 Da at both sides. A second condition is that this local maximum intensity must be above the quantile threshold intensity. Noise spectra do not contain real peaks with a high intensity flanks. As a consequence, many noise peaks are above the quantile threshold. Peak lists with too many peak masses above an arbitrary number of 450 fall off, because a large part of these peak positions are probably noise peaks.
Internal calibration is necessary to align all the spectra in the matrix. There are several methods reported to align mass spectra datasets. The alignment algorithms of Wong et al.  and Jeffries  have in common that they use special reference masses or peaks between the spectra. Wong et al.  have developed an algorithm written in C++ where spectral data points are added or deleted in regions with a low intensity, in order to a shift peaks. This algorithm has a slight effect on the shape of the peaks. However, the signals in MS are presented by peaks and not by the regions of minimal intensity. Jeffries  compares peaks lists generated from mass spectra. He uses R's smooth spline function to correct measured masses with help of reference calibrate masses. A smooth spline function, fλ, is drawn through the ratio of measured over real mass on the y-axis against the measured mass of the calibrate peaks on the x-axis, which results in a factor close to 1. Division of the measured masses by the calculated function fλ; interpolates all data points. Theoretically, a cubic spline function needs to pass through all of the calibrate data points. This results in a lot of curvature. A smooth spline is a compromise; where the function may deviate from calibrate data points within a certain limit, due to a factor λ, which diminishes the amount of slope. The amount of slope is expressed by the integrating the square of the second derivative of the spline function . Another alignment algorithm assumes no knowledge of peaks in common [20, 21]. This method considers the shape of the spectra, and aims to minimize the phase differences between the spectra. This process is named dynamic time warping. It is however easier to calibrate the channel numbers of the MALDI-TOF equipment against known masses, since the square root of mass over charge is theoretically proportional to the time. This dependency can be fit with a polynomial function. The masses in the peak list are internally calibrated, using the at least 4 of the 5 omnipresent albumin masses. The channel numbers in the peak list, with corresponding masses, which are the closest with a window of 0.5 Da to one of the albumin masses, are determined. Peak lists without the required number of albumin masses fall off. The channel numbers, i, and corresponding albumin masses, , are fit in a second-degree polynomial function
The coefficients, a , a, and a are calculated with R's linear model (lm) function where y = , x = i, and a is the array of a , a, and a
ft 3 ← lm(y ~I(x^2) + x) (5)
a ← coeff (ft 3) (6)
All peptide masses in the peak list are recalculated, using these coefficients and the polynomial function.
The database application can clearly distinguish the MALDI-TOF peptide profiles between different patient and control groups. It can determine differences in the frequency and intensities of peptide masses in spectra from both groups. A strong feature of the here described architecture is that it can process different MS file formats, such as peak lists, MALDI-TOF and FT-ICR binary files from various manufactures in the same manner. More important are speed and memory usage by the client workstation. Peptide profile matrices have to be created in reasonable time. When dealing with large quantity of data, the Java application will easily run into out of memory errors with default settings of the JVM. Very important to use limit and offset strategies in MySQL queries to fetch no more than a buffered amount of 5000 table records each time when displaying them in the GUI. A specific MALDI-TOF MS matrix of 111 samples and 1949 masses (Figure 4) has 216339 matrix fields and a CSV file size of 444 Kbytes. Three matrices, peptide mass occurrences, intensity, and binary of this size can be simultaneously built in the Java Virtual Machine's (JVM) allocated memory. However, a typical FTMS matrix with 374 samples and 10651 discriminated masses has an 18 times larger number of 3983474 matrix fields and an 18 times larger CSV file size of 7.9 Mbytes. It is impossible to build three matrices of these size simultaneously in the Java's memory space. These files have to be built in the user document root as a FileOutputStream and transported to the FTP server.
It is also possible to use the SJava package  to set up an interface between Java and the statistical software package R. It can be used to invoke Java methods and create Java objects by R commands. This is, however, the opposite of our approach, where R is called from Java. Another approach is to access R by a TCP-IP (Transmission Control Protocol-Internet Protocol) connection, using the service Rserve™. A disadvantage of using this method is that the Rserve has to be started explicitly by the operating system out of the Java application before running any R script from Java. It would be possible to make additional java classes for statistical routines, such as the Wilcoxon-Mann-Whitney test. Indeed for this one test it would be more logical to add it directly to the Java code. However, the usage of R goes beyond just the Wilcoxon-Mann-Whitney test, which is not being claimed to be the full analysis. The Wilcoxon-Mann-Whitney test is an example of a univariate test that is an important first step. In R, it is possible to switch to other univariate tests and most importantly multivariate analysis, such as hierarchical clustering in two dimensions (where Spotfire™ fails with very big matrices). In addition, R can be used for the peak finding algorithms (quantile calculations, baseline and noise level determination, etc.) which have the advantage that these algorithms are well tested and optimized for speed. The architecture allows the analysis to be extended to clustering, the building of multivariate classifiers, etc. (techniques we have already used in our previous paper ). This will be an important point to focus on in the future. A reason to implement a 2-tier architecture, thick client and database server, is to have an attractive Java GUI than less advanced interface and Java script in a web browser. It's possible to monitor preprocessing of the MS spectra with a progress bar. Another possibility is to convert instrument specific file types to uniform mzXML file format and display spectra with a Java mzXML viewer . A 3-tier architecture with presentation layer (web-browser), business logic provided by an application server, and a database server is more difficult to implement. For example, the file interface of Java with the software package R is more difficult than in the 2-tier architecture. In the 2-tier situation every user has its own file repository on the local machine. In a 3-tier, special precautions have to be taken to prevent time-out errors and performance issues, applying distributed computing in a grid. For example, an FT-ICR MS peptide profile matrix of 10651 discriminated masses and 374 samples with at least 3 peptide occurrences per mass has a size of 7887 Kbytes and is produced in no less time than 12 h.
More advanced techniques such as Fourier transform ion cyclotron resonance (FT-ICR) MS and offline nano LC-MALDI (Liquid Chromatography) in combination with FT-ICR measure accurate masses in the 0.5 to 1 ppm range. Furthermore, the higher resolution of FT-ICR MS prevents the clustering of peaks of different peptides. These techniques allow the identification of proteins from peptide masses by either peptide mapping or peptide sequencing. The database application can be adapted to handle the mass spectra of these experiments due to its modular architecture. The type of equipment, in combination with type of imported spectra will determine the handling of raw data, such as calibration and peak finding algorithms. In order to transform the spectra from the time domain to the frequency domain , an extra Fast Fourier Transformation (FFT) step to handle raw data of FT-ICR experiments is at present under construction. The peptide masses can subsequently be calculated from the cyclotron frequency. It is also possible to apply a de-isotope algorithm on the peptide masses due to the higher resolution and mass accuracy of FT-ICR. Peak centroiding will be implemented, which calculates the real mass of the peak maximum, weighted by the intensity of the points surrounding the local maximum.
The database is stripped to its essentials and contains all the necessary fields for preprocessing while most of the input parameters are stored in the matrices filenames. For example, the database design does not contain an authentication database for encrypted password storage and management of user accounts. Other tables that are not included are audit trail and action logging tables found in modern LIMS. Details of the database design are presented in the proteomics.txt create table script (added as a supplementary file). The database tables contain the necessary (on delete) table triggers, which ensure database integrity. The database design allows the comparison of large quantity of mass spectra. The table result offers the possibility to store retention times and to group sequential mass spectra from one sample in a LC-MS experiment. An improved peak finding algorithm based on signal to noise levels is under construction . An extra table with calculated peptide masses of expected proteins from MS-MS experiments can be added to the database, which will make a direct analysis of differentially expressed proteins possible.
The use of an ordinary FTP server in a university environment is a security risk that cannot be underestimated. On the other hand, FTP is a standard that is accepted and widely accessible across every network and operating system. First of all, precautions have to be taken with setup and configuration of an FTP server as described e.g. by Ray Zadjmool . The architecture disables anonymous access. However it not possible to register user accounts and connection is made by one root account. Users and IP addresses have to be logged as well as success of fail of account logon events. Account should be locked after several failed login attempts. Access to the FTP directory should be regulated using access control list (ACL) restrictions across Windows NT File System (NTFS) permissions. Disk quota should be enabled to limit the amount of disk space of a user, to prevent becoming a media file share place for hackers. IP address restriction should be set equal to the range of Hospital or University IP addresses. The user passwords must meet complexity requirements. However, FTP servers can only handle usernames and passwords in plain text, which can easily be intercepted by password sniffers. Sensitive data and login information can be encrypted for total security using FTPS or SFTP, which solves the problem of insecure FTP. FTPS (FTP over SSL) uses a Secure Socket program Layer (SSL) located between the FTP and Transport Control Protocol (TCP) layers. FTPS has the encryption capabilities of SSL with the advanced features of FTP. Unfortunately, Enterprise Distributed Technologies  provides only a commercial Java library, which supports FTPS and SFTP. An open source Java secure shell version 2 (SSH-2) library, jsch-0.1.28.jar, that supports SFTP (FTP over SSH) is provided by JCraft .
In the Bonferroni approach to n independent tests, the overall change β to make an error of type 1 is the product of the individual errors β'.
β = β'1 * β'2 * β'3 ......... β'n = (1-p')n = 1 - p (7)
According to the binominal theorem, for small values of p'
These equations show that the overall p-value threshold, p, should be divided by n to obtain the significance level, p' of the individual tests.
In both the CSF and prostate cancer datasets some tests satisfy the Bonferroni multiple test approach, for example 1.1 * 10–6 < 0.01/1949 and 2.7 * 10–7 < 0.01/1354. The Bonferroni approach may not be ideally suited for this type of data as the presence of individual peptide peaks may be correlated, since they can be isotopes of the same peptide or peptides from the same protein. Rather than lowering the p-value threshold in a Bonferroni approach, the complete p-value distribution (and a randomization method to check the expected distribution) is shown. The numbers are explicitly supplied, because the plot does not specify the exact p-values lower than 0.01.
A new software architecture is presented which can analyze high throughput MS data from MALDI-TOF MS measurements in a efficient way. Results of the analysis are stored in a centralized relational database and FTP server. Meta data of the experiment and samples can be stored as well, and can be used to link the results to clinical data or data from other types of experiments. The database application generates a matrix with the frequency of masses in replicate spectra from different samples, a binary table with the frequency of masses above a specific threshold, and a matrix with the mean intensity of the present peaks in the mass spectra replicates. The matrix, which is stored on the FTP server and in the local document directory, can be imported in statistical packages or in (commercial) analysis software such as Spotfire. Statistical analysis of two test datasets by the Wilcoxon-Mann-Whitney test in R clearly distinguishes the peptide-profiles of patient body fluids from those of controls. Finally, the modular architecture of the application makes it possible to also handle data from FT-ICR experiments.
Availability and requirements
Access Control List
American Standard Code for Information Interchange
Comma Separated Value
Entity Relationship Diagram
Fast Fourier Transformation
Free Induction Decay
Fourier Transform Ion Cyclotron Resonance
File Transport Protocol
Graphical User Interface
Java Virtual Machine
Java Database Connectivity
Laboratory Information Management System
Matrix Assisted Laser Desorption Ionization
mass over charge eXtensible Markup Language
Windows NT File System
Secure Shell version 2
Secure Socket program Layer
Structured Query Language
Standard Widget Toolkit
Transmission Control Protocol
The Netherlands Proteomics Centre (NPC), Virgo consortium, and research program Biorange of the Netherlands Genomics Initiative and the EU P-mark project financially supported this study. The authors thank Frank L. Morin of the Technische Hogeschool Rijswijk for his contribution to the Java programming and Eric Brouwer for technical assistance with installing the hardware.
- Dekker LJ, Boogerd W, Stockhammer G, Dalebout JC, Siccama I, Zheng P, Bonfrer JM, Verschuuren JJ, Jenster G, Verbeek MM, Luider TM, Sillevis Smitt PA: MALDI-TOF mass spectrometry analysis of cerebrospinal fluid tryptic peptide profiles to diagnose leptomeningeal metastases in patients with breast cancer. Mol Cell Proteomics 2005, 4(9):1341–1349. 10.1074/mcp.M500081-MCP200View ArticlePubMed
- Dekker LJ, Dalebout JC, Siccama I, Jenster G, Sillevis Smitt PA, Luider TM: A new method to analyze matrix-assisted laser desorption/ionization time-of-flight peptide profiling mass spectra. Rapid Commun Mass Spectrom 2005, 19(7):865–870. 10.1002/rcm.1864View ArticlePubMed
- Deitel H, Deitel P: Java how to program. 6th edition. New Jersey: Pearson – Prentice Hall; 2005.
- Zschunke M, Nieselt K, Dietzsch J: Connecting R to Mayday, Chapter 2: Calling R from within Java. Studienarbeit Bioinformatik: 2004 2004, 7–12. [http://www.zbit.uni-tuebingen.de]
- Lemkin PF, Thornwall G, Alvord WG, Lubomirski M, Sundaram S: Extending MicroArray Explorer with R Language Scripts. Frederick bioinformatics forum 2003. [http://maexplorer.sourceforge.net]
- Westman A, Nilsson CL, Ekman R: Matrix-assisted laser desorption/ionization time-of-flight mass spectrometry analysis of proteins in human cerebrospinal fluid. Rapid Commun Mass Spectrom 1998, 12(16):1092–1098. 10.1002/(SICI)1097-0231(19980831)12:16<1092::AID-RCM286>3.0.CO;2-NView ArticlePubMed
- Zhang J, Goodlett DR, Peskind ER, Quinn JF, Zhou Y, Wang Q, Pan C, Yi E, Eng J, Aebersold RH, Montine TJ: Quantitative proteomic analysis of age-related changes in human cerebrospinal fluid. Neurobiol Aging 2005, 26(2):207–227. 10.1016/j.neurobiolaging.2004.03.012View ArticlePubMed
- Mann HB, Whitney DR: On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat 1947, 18: 50–60.View Article
- Siegel S, Castellan NJ: Nonparametric statistics for the behavioral sciences. 2nd edition. New York: McGraw-Hill Book Co; 1988.
- Wilcoxon F: Individual comparisons by ranking methods. Biometrics Bull 1945, 1: 80–83. 10.2307/3001968View Article
- Wong JWH, Cagney G, Cartwright HM: SpecAlign – processingand alignment of mass spectra datasets. Bioinformatics 2005, 21: 2088–2090. 10.1093/bioinformatics/bti300View ArticlePubMed
- Jeffries N: Algorithms for alignment of mass spectrometry proteomic data. Bioinformatics 2005, 21: 3066–3073. 10.1093/bioinformatics/bti482View ArticlePubMed
- Lin SM, Haney RP, Campa MJ, Fitzgerald MC, Patz EF Jr: Characterizing phase variations in MALDI-TOF data and correcting them by peak alignment. Cancer Informatics 2005, 1(1):38–100.
- Ramsay JO, Li X: Curve registration. J Roy StatSoc, Ser B 1998, 60(2):351–363. 10.1111/1467-9868.00129View Article
- Press WH, Flannery BP, Teukolsky SA, Vetterling WT: Numerical recipes in C: the art of scientific computing, Chapter 12: Fast Fourier Transform. 2nd edition. Cambridge: CambridgeUniversity Press; 1992.
- Gentzel M, Köcher T, Ponnusamy S, Wilm M: Preprocessing of tandem mass spectrometric data to support automatic protein identification. Proteomics 2003, 3(8):1597–1610. 10.1002/pmic.200300486View ArticlePubMed
- Horn DM, Zubarev RA, McLafferty FW: Automated reduction and interpretation of high resolution electrospray mass spectra of large molecules. J Am Soc Mass Spectrom 2000, 11(4):320–332. 10.1016/S1044-0305(99)00157-9View ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.