LC-MSsim – a simulation software for liquid chromatography mass spectrometry data

Background Mass Spectrometry coupled to Liquid Chromatography (LC-MS) is commonly used to analyze the protein content of biological samples in large scale studies. The data resulting from an LC-MS experiment is huge, highly complex and noisy. Accordingly, it has sparked new developments in Bioinformatics, especially in the fields of algorithm development, statistics and software engineering. In a quantitative label-free mass spectrometry experiment, crucial steps are the detection of peptide features in the mass spectra and the alignment of samples by correcting for shifts in retention time. At the moment, it is difficult to compare the plethora of algorithms for these tasks. So far, curated benchmark data exists only for peptide identification algorithms but no data that represents a ground truth for the evaluation of feature detection, alignment and filtering algorithms. Results We present LC-MSsim, a simulation software for LC-ESI-MS experiments. It simulates ESI spectra on the MS level. It reads a list of proteins from a FASTA file and digests the protein mixture using a user-defined enzyme. The software creates an LC-MS data set using a predictor for the retention time of the peptides and a model for peak shapes and elution profiles of the mass spectral peaks. Our software also offers the possibility to add contaminants, to change the background noise level and includes a model for the detectability of peptides in mass spectra. After the simulation, LC-MSsim writes the simulated data to mzData, a public XML format. The software also stores the positions (monoisotopic m/z and retention time) and ion counts of the simulated ions in separate files. Conclusion LC-MSsim generates simulated LC-MS data sets and incorporates models for peak shapes and contaminations. Algorithm developers can match the results of feature detection and alignment algorithms against the simulated ion lists and meaningful error rates can be computed. We anticipate that LC-MSsim will be useful to the wider community to perform benchmark studies and comparisons between computational tools.

We chose the parameter settings for each tool as described below. Three tools (msInspect, Decon2LS and Mzmine) still produced large numbers of false positives, as stated below. We therefore removed for each of these three tools all peptide features with an intensity below the first quantile. The first quantile is the value such that 25% of all feature intensities are below it and the rest above. msInspect: This tools does not have any parameters besides m/z and rt ranges searched. Each isotope pattern is scored using the Kullback-Leibler (KL) distance between an averagine model and the true peak intensities. A KL distance of zero indicates a perfect match between averagine model and signal and thus a high-quality peptide feature. To find a suitable cutoff for this score, we executed msInspect on a complex LC-MS run, a digest of Halobacterium NRC-1 proteins recorded on a API QSTAR Pulsar I instrument (downloaded from the PeptideAtlas database, ref: Pae000245, data set # 25, http://www.peptideatlas.org/). We manually annotated this data set and choose 20 intense and well-resolved peptide signals to determine a suitable cutoff. We choose a cutoff for the KL distance of 0.8 such that 80% of the annotated features were detected. Applying this filtering threshold to the whole data set, the number of detected features was reduced from 3366 to 2608 features.
Decon2LS: This tool offers plenty of parameters, some are documented and some are not. According to the authors (personal communication), the fit intensity and the fit score threshold should have significant influence on the result. However, we made the experience that the fit intensity threshold has not much influence on the result. Consequently, we optimized the fit score threshold (a distance measure between 0 and 1, where 0 is best) in the same way as the KL distance described above.
We choose a cutoff for the fit score of 0.2 which resulted in 85% annotated features recovered. The overall number of features was reduced from 19603 to 14592.
MZmine: This software offers different peak detection strategies. We used the "Recursive threshold peak detecter" algorithm. Furthermore, the software has several parameters that influence the feature detection process. This process consists of two parts: a peak detection and a de-isotoping step. Peak detection is influenced by parameters such as bin sizes, min intensity etc. The de-isotoping step groups peaks into isotopic pattern, estimates a charge and removes incomplete isotope pattern and single peaks. We contacted the developers of Mzmine and asked them which parameters would have significant influence. MZmine does not compute a score and s/n threshold that could be used as single filter criterion. According to the recommendations of the MZmine developers, we adopted the "Chromatographic threshold level" and the "m/z bin size" as well as the noise threshold and minimum peak height which are both given in absolute intensity units.
For the high-resolution data (FWHM 0.02), we choose a set of parameters that recovered 100% of the annotated features from the QSTAR mass spectrometer. With decreasing mass resolution, we relaxed the bin width and the chromatographic threshold level but without satisfying results. In each case, we estimated the noise level as the median intensity on a small m/z interval in empty region (i.e. without peptide signals) of the LC-MS map. We set the minimum peak intensity to the same value.

SpecArray:
No parameter changes possible, no manual tuning.
OpenMS + Superhirn : several parameter offered, moderate optimization to achieve trade-off between false positives and false negatives.

Avaibility of software and simulation parameters:
All peptide feature detection algorithms were tested in the version that was available online in January 2008.
In the case of OpenMS, we used a slightly modified version of the 1.0 Release. It can be downloaded from sourceforget using: svn co https://open-ms.svn.sourceforge.net/svnroot/open-ms/FF10 The installations instructions are the same as for the most recent OpenMS release version (1.2) and can be obtained from www.openms.de (=> follow link "installation").
The parameter settings and peptide feature lists for each feature detection algorithm, as far as they could be stored, are available in the supplement.
The simulated data sets are several GBs large and can be downloaded from the PRIDE database (http://www.ebi.ac.uk/pride/, Accession numbers 8161-8168 incl.).