Peak intensity prediction in MALDITOF mass spectrometry: A machine learning study to support quantitative proteomics
 Wiebke Timm^{1, 4}Email author,
 Alexandra Scherbart^{1},
 Sebastian Böcker^{2},
 Oliver Kohlbacher^{3} and
 Tim W Nattkemper^{1}
DOI: 10.1186/147121059443
© Timm et al; licensee BioMed Central Ltd. 2008
Received: 31 May 2008
Accepted: 20 October 2008
Published: 20 October 2008
Abstract
Background
Mass spectrometry is a key technique in proteomics and can be used to analyze complex samples quickly. One key problem with the mass spectrometric analysis of peptides and proteins, however, is the fact that absolute quantification is severely hampered by the unclear relationship between the observed peak intensity and the peptide concentration in the sample. While there are numerous approaches to circumvent this problem experimentally (e.g. labeling techniques), reliable prediction of the peak intensities from peptide sequences could provide a peptidespecific correction factor. Thus, it would be a valuable tool towards labelfree absolute quantification.
Results
In this work we present machine learning techniques for peak intensity prediction for MALDI mass spectra. Features encoding the peptides' physicochemical properties as well as stringbased features were extracted. A feature subset was obtained from multiple forward feature selections on the extracted features. Based on these features, two advanced machine learning methods (support vector regression and local linear maps) are shown to yield good results for this problem (Pearson correlation of 0.68 in a tenfold cross validation).
Conclusion
The techniques presented here are a useful first step going beyond the binary prediction of proteotypic peptides towards a more quantitative prediction of peak intensities. These predictions in turn will turn out to be beneficial for mass spectrometrybased quantitative proteomics.
Background
Today, mass spectrometry (MS) is an indispensable technique for the analysis of proteins and peptides in the life sciences. Various approaches have been developed to allow the comparison of protein abundances in cells between different environmental states. A growing number of studies in proteomics aim to quantitatively characterize proteomes for a better understanding of cellular mechanisms. These studies use either isotopic labeling or labelfree methods for protein quantification [1].
Using isotopic labeling methods, protein mixtures are tagged with a stable isotope that can be used to tell samples apart by their mass shift and to directly compare peaks from different samples. With SILAC (Stable Isotope Labeling with Amino acids in Cell culture) [2], a metabolic labeling method, labels are introduced during cell growth and division. Chemical labeling methods such as ICAT (Isotope Coded Affinity Tags) [3] and iTRAQ (Isotope Tags for Relative and Absolute Quantification) [4] introduce the label into peptides after proteolytic digestion. These methods allow accurate quantification relative to the tagged sample at the expense of additional costly and timeconsuming experimental processing steps. Enzymatic labeling with ^{18}O [5] during or after proteolytic digestion is another technique that avoids the complications that chemical labeling may cause but can be applied if metabolic labeling is not possible. However, labeling efficiency differs between peptides, which causes difficulties when comparing abundances between different proteins.
In contrast, labelfree methods directly use the signal intensities or spectral counts to estimate peptide abundances. But peak intensities also depend on peptide ionization efficiencies, which are influenced by a peptide's composition and the chemical environment. In other words, the sensitivity of a mass spectrometer varies between peptides. Therefore, two peptides with identical abundance will generally lead to different peak intensities. In a recent review of labelfree LCMS, America and Cordewener [6] state that "Normalisation of peptide abundance data is probably the most essential for improvement of the quantitative accuracy of the experiment." Absolute quantification with very high accuracy using labelfree methods is possible through the use of reference peptides [7], an example being AQUA (Absolute Quantification of Proteins) [8]. But again, such methods require significant experimental effort.
Consequently, labelfree techniques are routinely used only for differential quantification, that is, the determination of concentration ratios between samples.
Nonetheless, labelfree methods have several intrinsic advantages over labeling techniques. Obviously, they do not require labor and costintensive labeling. Also, there is no fundamental limit to the number of samples that can be compared. Unlike labeling techniques, labelfree methods do not increase the mass spectral complexity. They have the potential to analyze a higher range of protein concentrations and to achieve a higher proteome coverage.
There exist two fundamentally different experimental setups for labelfree quantification using MS: in both cases, proteins are digested and peptides are separated using Liquid Chromatography (LC). In the first case, the LC is directly coupled to an ElectroSpray Ionization (ESI) mass spectrometer, which allows a simple experimental setup and a rapid analysis of separated peptides. In the second case, LC fractions are spotted and analyzed using MatrixAssisted Laser Desorption Ionization (MALDI) MS [9–11]. Using MALDI MS has certain advantages such as a more efficient datadependent analysis: because the sample portions from the LC can be stored for several days and reanalyzed when necessary, it is possible to acquire fragmentation ion spectra for all MS parent ions that are of interest. Spectra are easier to interpret and compare because mostly singly charged ions are produced and detected.
If an estimate of the peptidespecific sensitivity of the mass spectrometer were possible, this would allow the use of labelfree techniques for absolute quantification. For this, different machine learning techniques have been proposed. Lu et al. [12] estimate the detectability of peptides for the APEX method, i.e. the probability of a peptide being observed in a spectrum at all. The authors evaluate different classifiers for this purpose, and find that bagging with a forest of random decision trees produces the best results. The predicted values are then used to enhance the spectral countsbased, uncorrected abundance estimation by about 30%. Tang et al. [13] use twolayer neural networks to classify peptides into detectable and undetectable. They derive a Minimum Acceptable Detectability for Identified Proteins (MDIP), a cutoff value that maximizes the sum of true positives and true negatives. The MDIP is shown to increase as protein abundance decreases, which, according to the authors, could be utilized to improve quantification. For both studies, data is acquired using LC coupled to ESI. Mallick et al. [14] introduce the term proteotypic for peptides that are observed in more than 50% of the spectra they are expected in. The authors classify peptides from four different MS setups (PAGEMALDI, PAGEESI, MUDPITESI and MUDPITICAT) into proteotypic and nonproteotypic peptides using a Gaussian mixture model, and achieve a cumulative accuracy of up to 90%.
Both Lu et al. [12] and Tang et al. [13] use predicted detection probabilities to correct peptide abundances. In both cases, the authors utilize detection probabilities as if these probabilities really were peak intensities. But for peak intensity correction, it seems much more appropriate to predict peak intensities directly. Gay et al. [15] classify peptides into observed and unobserved ones and, in addition, directly predict peak intensities via regression. Unfortunately, this study is flawed by the fact that the same peptides were used in the training and test sets, to make up for the small size of the dataset.
For our initial evaluation, we do not use MS measurements of LCseparated peptides; instead, proteins are separated via 2D polyacrylamide gel electrophoresis (2DPAGE). Separated probes usually contain only a single protein that is subsequently digested and analyzed by MALDI Time of Flight (TOF) mass spectrometry. In this experimental setup, given a correct preparation and only one protein in the digested gel spot, all peptides in a spectrum have exactly the same abundance. Therefore, the peptidespecific sensitivity of the mass spectrometer can be accessed by comparing peak intensities in every such spectrum. A peptidespecific correction factor can then be calculated by dividing one over the corresponding peak intensity. It is understood that one cannot experimentally determine the peptidespecific sensitivity for all possible peptides.
In this study, we investigate if peak intensities of peptides in MALDITOF MS spectra can be predicted with machine learning (ML) techniques. We do not predict the probability of a certain peptide to be detectable but instead predict its normalized peak intensity directly. This is an important step to facilitate the enhancement of labelfree quantification accuracy. We reach a prediction accuracy of r = 0.66 for acrossdataset prediction, where r is the Pearson correlation between predicted and observed intensities. Our datasets stem from 2DPAGE separated proteins, but clearly, our results can directly be applied to enhance the quantification accuracy of MALDIbased LCMS experiments [9–11]. A study by Mallick et al. [14] shows that a slightly simpler problem, the prediction of proteotypic peptides, can be done successfully for LCMS experiments. We are therefore hopeful that regression approaches like the one proposed here may be successful for LCMS data, too.
Let s be the sequence of a peptide we have identified, and let I be the intensity of the MS peak corresponding to this peptide. Instead of directly using I as an estimate for the relative peptide concentration, we apply ML to compute a predicted intensity pI(s) of the peptide solely from the peptide sequence. We can now calculate a corrected peptide intensity I' = 1/pI(s)·I that replaces I in subsequent steps of the analysis. Here, 1/pI(s) is the peptidespecific correction factor. If the peak intensity estimate pI(s) is accurate then I' is a better estimate than I of the relative peptide concentration.
Methods
We use two sets of MALDITOF mass spectra to generate the datasets for this study. These were measured on a Bruker Ultraflex instrument (Bruker Daltonics, Bremen, Germany) using proteins extracted from Corynebacterium glutamicum [16]. The proteins were separated by 2D gel electrophoresis, then digested into peptide fragments with trypsin prior to MS analysis. The corresponding peptide sequences were derived from protein identification using MASCOT peptide mass fingerprinting [17] and an inhouse database containing C. glutamicum protein sequences. For the smaller dataset A, both the selection of spectra and determination of search parameters for the MASCOT identification were done manually, while for dataset B, both were done automatedly.
For dataset A, spectra were manually selected from a previously unpublished set of spectra. Search parameters were chosen manually by an expert and included fixed (carbamidomethylation of cysteine) and variable modifications (oxidation of methionine), and no missed cleavages were allowed. The list of spectra was filtered for the 20% of spectra with the best MASCOT score and the largest difference from the second best hit, resulting in 62 spectra being used for further analysis. Of 27 identified proteins, 15 were present in multiple spectra. For dataset B, spectra were run through fully automated MASCOT peptide mass fingerprinting search with 42 different sets of search parameters. These included with and without oxidation of methionine, tolerance within {50, 100, 150, 200, 250, 500, 750} ppm, and up to {0, 1, 2} missed cleavages allowed. The resulting list was filtered automatically to fulfill the following properties: a) protein mass in range [8000, 12000] Da, b) pI between 4 and 7 because of the 2D gel used, c) highest MASCOT protein hit score above 65, and d) sequence coverage above 15% using the search parameters that produced the highest score. Spectra with more than one highscoring hit were also removed. Application of this protocol left 182 spectra for further analysis. Of 125 identified proteins, 35 were present in more than one spectrum. To summarize the differences, A can be considered a small, carefully chosen dataset while B is larger and of lesser overall quality. An overview table can be found in additional file 1: datasets and a histogram showing the number of spectra per protein in additional file 2: spectranumbers. The additional files 3 and 4: dataset_a_protein_list and dataset_b_protein_list show lists of the identified proteins. We give a short outline of our spectra preprocessing pipeline; details are deferred to additional file 5: preprocessing). After denoising with a SavitzkyGolay filter [18], baseline correction, and removal of noise peaks, peak intensities are extracted from the spectra. We unfold isotopic distributions by adding up all peak intensities of isotope peaks that belong to the same peptide. The resulting list of peaks is matched against masses calculated from a theoretical tryptic digestion. The matching of sequences to peaks is straightforward in this case because the peptide sequences are known. We ignore missed cleavages and variable modifications in the matching process. We allow for a mass error of up to 1 Da to make up for calibration errors. In case of multiple peaks inside the allowed mass error range, the one closest to the theoretical mass is chosen.
In dataset A, for 371 out of 535 expected peptides, we detected the corresponding peak in the mass spectrum. In dataset B, 1023 out of 1994 predicted peaks were detected. For most peptide sequences, only one peak intensity measurement exists: in dataset A, 49.2% of peptide sequences are unique, whereas in dataset B, this is true for 70.2% of the peptides.
Abundances differ between spectra. To use intensities from different spectra together in one dataset, we need to normalize them. Since the amount of protein in each spectrum is unknown, we introduce two normalizations for peak intensities. We compare the performance of both normalizations in our experimental evaluation.:

Normalization by corrected mean ion current (mic). The intensity of a peak p is scaled by the mean ion current, i.e. the mean of all baselinecorrected measurements C_{1},...,C_{ N }in the spectrum:${I}_{p}^{mic}=\frac{{I}_{p}}{\frac{1}{N}{\displaystyle {\sum}_{i=1}^{N}{C}_{i}}},$
where I_{ p }denotes the raw intensity of peak p after peak extraction. Here, index i runs over all raw values (i.e. not only peptide peaks) of the spectrum the peptide was found in. By doing so, we take into account intensities of unmatched peptides as well as differences in the overall sensitivity of the detector.

Normalization by sum of peptide peak intensities (sum). The intensity of a peak p is scaled by the sum of all matched peptide's peak intensities i = 1,...,P to yield${I}_{p}^{sum}=\frac{1000\cdot {I}_{p}}{{\displaystyle {\sum}_{i=1}^{P}{I}_{i}}},$
where I_{ i }denotes the intensity of the i^{ th }peptide peak after peak extraction. A similar approach has been used by Radulovic et al. [19]. The factor 1000 is used for numerical reasons.
Some peptides are present in more than one spectrum and, hence, these peptides show more than one peak intensity value. Most learning architectures do not cope well with different target values per input.
Therefore, we calculate an αtrimmed mean with α = 50% for peptides with more than three target values, and a mean for peptides with two or three target values. As we will see below, this also allows us to estimate the potential prediction accuracy. As a final preprocessing step, we logarithmize intensities with a natural logarithm such that the resulting error becomes additive, which stabilizes the variance [1, 20, 21], and the values themselves become approximately normal distributed (see QQ plots in additional file 6: qqplots).
We can now state the peak intensity prediction problem as a supervised learning problem. A training set Γ = {(x, y)_{ j }, j = 1,..., N} consists of inputoutput pairs (x, y) where x ∈ ℝ^{ t }is an input peptide feature vector, and y ∈ ℝ is the normalized intensity we want to predict (target value). Obviously, this is a (nonlinear) regression problem and a wide range of techniques can be applied.
When calculating target intensities for each peptide, we assume that all proteins were correctly identified, we assume perfect digestion, and we ignore variable modifications for all steps following identification. We are aware that this is not perfectly true in reality: There are a few missed cleavages, variable modifications were found during the database search, and we can not totally exclude the possibility of misidentification, even though MASCOT thresholds were chosen to practically exclude this case. In this sense, our datasets can be seen as "imperfect" and cleaner datasets could be used. But if intensity prediction is possible using this imperfect, partly erroneous and noisy data, results will only improve when cleaner datasets are available. To show that we in fact learn to predict intensities from peptide sequences, we shuffled the assignment of peptide sequences to target values and found that no prediction is possible for the shuffled dataset; see below.
The same arguments hold for the effect of ion suppression [22]: the signal of a compound is suppressed in the presence of other compounds that compete for ionization. Intensity prediction could take this effect into account if each peptide occurred in multiple spectra with all possible combinations of other peptides, and if there was no contamination. However, such a dataset is impossible to acquire. Knowing this, we neglect the fact that peptide peak intensities depend not only on the peptide's constitution, but also on the combination of other peptides present. We consider this effect an additional noise component our method has to cope with. In any application of our method, this effect would always be unknown, and we aim at a realistic estimation of the performance of our method.
Machine learning methods
We selected two complementary nonlinear regression architectures. νSupport Vector Regression (νSVR) [23] has excellent generalization abilities and copes well with highdimensional input data [24]. However, it is difficult to interpret its models, and parametertuning can be timeconsuming. The second learning architecture, the Local Linear Map (LLM) [25] is less accurate if applied to higherdimensional feature spaces but is very efficient (i.e. fast training and adaptation to addition data, and low memoryusage). It is more transparent and can be used for peptide prototyping as shown in [24]. Both architectures represent different principles of learning nonlinear regression functions. There exist cases where a linear model outperforms more sophisticated regression models. Therefore, we apply a simple linear model (LM) [26] for comparison. Other methods have been tested and perform similarly to or worse than those presented here (data not shown).
νSupport Vector Regression (νSVR)
Support vector machines are a class of learning algorithms that are designed to implicitly transfer input feature vectors into a highdimensional feature space where classes are linearly separable and the optimal linear decision boundary can be calculated. In practice, however, it depends on the choice of a kernel function and the data whether linear separability is actually achieved. For support vector regression, the εinsensitive loss function y  f(x)_{ ε }= max {0, y  f(x)  ε} was introduced by Vapnik et al. [27]. Here, errors are only considered if they are higher than ε for some fixed ε > 0. Since the choice of an appropriate ε can be difficult, the νSVR introduced by Schölkopf et al. [23] finds the best ε automatically by minimizing a cost function. Instead, ν, an upper bound of the number of allowed errors and a lower bound to the number of support vectors, has to be chosen a priori. The νSVR generalizes an estimator for the mean of a random variable discarding the largest and smallest examples (a fraction of at most $\frac{\nu}{2}$ of either category), and estimates the mean by taking the average of the two extremal values of the remaining examples. This results in good robustness of the νSVR. Other parameters that have to be chosen are the regularization parameter C and kernel width γ. Parameter C controls the tradeoff between the weight of errors and the complexity of the regression function. Parameter γ controls the width of the radial basis function $K(x,{x}^{\prime})={e}^{\gamma \left\rightx{x}^{\prime}{}^{2}}$, which we use as kernel function. We use the libsvm implementation of the e1071 package available for R [28, 29].
Local Linear Map (LLM)
Local Linear Maps [25], a type of artificial neural net, combines an unsupervised vector quantization algorithm based on SelfOrganizing Maps (SOM) [30] with supervised linear learning principles for prediction of peak intensities. The LLM can learn global nonlinear regression functions by fitting a set of local linear functions to the training data. It has been successfully used for peptide prototyping [31] and provides a basis for peptide feature profiling and visualization.
Motivated by the SOM, an LLM consists of a set of n_{ l }regular ordered nodes v_{ i }, i = 1,...,n_{ l }, which are connected to each other via a twodimensional grid structure, defining a neighborhood between the nodes and a topology in feature space. Each node consists of a triple ${v}_{i}=({w}_{i}^{\text{in}},{w}_{i}^{\text{out}},{A}_{i})$. The vectors ${w}_{i}^{\text{in}}\in {\mathbb{R}}^{{d}_{\text{in}}}$ are used to build prototype vectors adapting to the statistical properties of the input data ${x}_{\xi}\in {\mathbb{R}}^{{d}_{\text{in}}}$. The vectors ${w}_{i}^{\text{out}}\in {\mathbb{R}}^{{d}_{\text{out}}}$ approximate the distribution of the target values ${y}_{\xi}\in {\mathbb{R}}^{{d}_{\text{out}}}$. The matrices ${A}_{i}\in {\mathbb{R}}^{{d}_{\text{in}}\times {d}_{\text{out}}}$ are locally trained linear maps from the input to the output space.
The learning step widths ϵ, ϵ^{ A }, ϵ^{ out }∈ [0; 1] for updating neighbors and s are decreased during training. After adapting the prototypes, a classification can be applied by assigning every input vector x to its closest prototype.
The concept of approximating nonlinear functions by fitting simple models to localized subsets of the data is related to other regression approaches like LocallyWeighted Regression (LOESS) [32] and to radial basis functions (RBF) [33]. Hastie et al. demonstrated the usefulness of locally linear function fitting as well [34]. We use here our own implementation of LLM as an R package.
Linear model (LM)
A linear regression model assumes the data to have the structure y = X^{ T }b, which corresponds to data lying on a straight line. Here, X is the input data in matrix formulation, y the vector of target values, and b a vector of coefficients that have to be found when adapting the model to the data. The ordinary least squares (OLS) algorithm is applied to find the coefficients. OLS minimizes the squared differences (errors) between the model's output and the target values of the training examples.
Feature extraction
We cannot directly use peptide sequences as input for the learning architectures, and derive numerical feature vectors x_{ j }to represent molecular features of peptide j. In bioinformatics, peptides are usually represented as strings over the alphabet of amino acid characters. However, a biochemist is more interested in the chemical properties of a peptide to characterize it. These paradigms motivate different feature sets:

A 20dimensional feature set with only single amino acid counts (mono).

A purely sequencebased 9220dimensional sequence feature set (seq). Each peptide is mapped to the 9220dimensional vector by counting how often a certain kmer appears in the sequence. Each feature vector contains 20 counts for all single amino acids (like mono), 400 counts for all dipeptides, 8000 counts for all tripeptides, and two times 400 counts for terminal dipeptides at the beginning and end of the peptide sequence. Additional file 7 (substringfrequency) shows the frequency distribution of di and tripeptide strings in the used datasets. Here, "dipeptides" or "tripeptides" refer to substrings of the peptide sequences, not to single molecules consisting of two or three amino acids. We will show below that the pure sequencebased feature set is often not sufficient for a decent prediction of peak intensities, what motivates the use of the next feature set.

A 531dimensional chemical feature set (aa) computed from amino acid attributes. Attributes are taken from the amino acid index database [35]. Each amino acid index AA = (AA_{1},...,AA_{20}) consists of twenty real values for the twenty amino acids. Let m(s) be the number of occurrences of the amino acid s in the amino acid index. For a peptide S = s_{1}...s_{ n }, the value for the corresponding feature f is calculated as the sum of attribute values, $f={\displaystyle {\sum}_{k=1}^{n}A{A}_{m({s}_{k})}}$. This value reflects the overall property of the peptide. There are 516 attributes in the amino acid index database, therefore we can calculate 516 such features. In addition, we use features for peptide length, mass, and numbers and fractions of acidic, basic, polar, aliphatic, and arginine residues. Finally, three features for gasphase basicity are added to the feature vector: a) The estimated gasphase basicity is calculated as proposed by Zhang et al. [36] as well as b) a sum over the residual values of the amino acids that were used for this estimation, and c) that sum scaled with the length of the peptide.
In the course of our analyses, we also evaluate what features are particularly important for the task of predicting peak intensities. This leads us to the following feature set:

A reduced feature set resulting from forward stepwise selection on aa and seq. This selected subset feature set (sss) is 18dimensional; its features can be found in Table 1. The following section explains how they were chosen.
Features constituting the sss feature set
feature ID  explanation  selected 

GB500  Estimated gasphase basicity at 500 K (Zhang et al., 2004)  20 
VASM830103  Relative population of conformational state E (Vasquez et al., 1983)  11 
NADH010106  Hydropathy scale (36% accessibility) (NaderiManesh et al., 2001)  9 
FAUJ880111  Positive charge (Fauchere et al., 1988)  6 
WILM950102  Hydrophobicity coefficient in RPHPLC, C8 with 0.1%TFA/MeCN/H_{2}O (Wilce et al. 1995)  6 
OOBM850104  Optimized average nonbonded energy per atom (Oobatake et al., 1985)  2 
mass  Molecular mass of the peptide   
KHAG800101  The Kerrconstant increments (KhanarianMoore, 1980)   
NADH010107  Hydropathy scale (50% accessibility) (NaderiManesh et al., 2001)   
ROBB760107  Information measure for extended without Hbond (RobsonSuzuki, 1976)   
FINA770101  Helixcoil equilibrium constant (FinkelsteinPtitsyn, 1977)   
ARGP820102  Signal sequence helical potential (Argos et al., 1982)   
R  No. of arginine residues  20 
F  No. of phenylalanine residues  20 
M  No. of methionine residues  17 
Q  No. of glutamine residues  5 
Y  No. of tyrosine residues  4 
H  No. of histidine residues   
All features are centered and normalized by variance prior to training. Datasets and feature vectors are available from ftp://ftp.cebitec.unibielefeld.de/pub/blind/PeakIntensityPrediction_Data.tgz. Raw data and identifications are available from ftp://ftp.cebitec.unibielefeld.de/pub/blind/PeakIntensityPrediction_RawData.tgz.
Feature selection
Often, accuracy, speed, and interpretability can be increased by reducing the number of features. To select a few features out of hundreds, we apply a simple greedy selection method as follows: a forward stepwise selection as described in [37] (section 3.4) was applied twenty times to the aa and seq feature set of dataset A (mic). This method starts with the intercept and calculates a value ${F}_{i}=\frac{e{e}_{+}}{{e}_{+}/(Nk2)}$ for each feature i, where e is the prediction error of a 10fold crossvalidation with the νSVR on the current model and e_{+} the prediction error of the model with the additional feature i. N denotes the number of training examples, and k the number of features in the smaller model. The feature that gives the highest value F_{ i }is added to the model before the next iteration. The procedure is repeated until no feature produces an F_{ i }that is higher than the 95th percentile of the F_{1,Nk2}distribution. For the νSVR, we used the parameters chosen by the grid search on the full feature set, since it is infeasible to repeat a grid search for each selection step. The resulting feature set depends on random partitioning during 10fold crossvalidation. Thus, each application of the selection algorithm can produce a different feature set. We selected those features that were selected most often (≥ 5 out of 20), reviewed the chosen features, and added others that might also be important for peptidespecific sensitivity of a mass spectrometer.
Evaluation techniques
We determine the best parameter set for each regression model using 10fold crossvalidation. We make sure that the ten sets contain disjunct peptide sequences. A grid search over the parameter space is performed to determine optimal parameters. The 10fold crossvalidation is applied for each parameter set. The best parameter set is the one with highest mean Pearson correlation coefficient (r) between target and predicted value over all ten test sets. The mean squared error (MSE) has also been calculated. However, it is not comparable between datasets without additional normalization, whereas the Pearson correlation coefficient is independent of the scale used. In addition, in this case, the parameter set with the lowest MSE is always the same as the one with the highest r, or very close to it.
For νSVR, the grid search runs over C ∈ {e^{3}, e^{1},...,e^{13}} and γ ∈ {e^{13}, e^{11},...,e^{5}} in steps of e^{2}, and ν ∈ {0.2, 0.3,...,0.8} in steps of 0.1. The parameters sampled by the LLM are prototypes n ∈ {1, 2, 3, 6, 10, 15, 25}, width of the neighborhood function σ ∈ {5, 2, 1, 0.4}, linear vs. exponential decrease of learning step size, and number of learning iterations t ∈ {10, 50}.
The final model is built by retraining the whole dataset with the optimal parameter set determined in the model selection step. Since the parameters may be adapted to the data used for crossvalidation, we estimate the performance on new data by validating the prediction accuracy on the other dataset: The model trained on A is validated on B, and vice versa. Both datasets have 193 peptides in common, therefore we also validate across datasets with these 193 peptides excluded. The MSE of this acrossdataset validation can be reduced by normalizing by variance and centering the data. Pearson's correlation coefficient is not affected by such operations.
Results and discussion
Overview of Pearson's correlation coefficients using mic normalization
validation  dataset  feature set  νSVR  LLM  LM 

10fold CV  A  aa  0.66  0.52  0.60 
sss  0.66  0.67  0.51  
mono  0.68  0.64  0.52  
seq  0.57  0.34  0.34  
B  aa  0.53  0.46  0.49  
sss  0.55  0.53  0.49  
mono  0.47  0.53  0.48  
seq  0.44  0.27  0.41  
across datasets  A  aa  0.65  0.52  0.14 
sss  0.63  0.59  0.47  
mono  0.66  0.57  0.45  
seq  0.46  0.21  0.40  
B  aa  0.45  0.24  0.01  
sss  0.50  0.44  0.39  
mono  0.45  0.39  0.32  
seq  0.32  0.05  0.28  
A  aa  0.58  0.47  0.00  
sss  0.58  0.55  0.41  
mono  0.61  0.52  0.39  
across datasets  seq  0.37  0.21  0.22  
without duplicates  B  aa  0.44  0.42  0.00 
sss  0.53  0.46  0.40  
mono  0.46  0.44  0.32  
seq  0.32  0.00  0.03 
Best performance
Summary table for additional scatter plots
DS  normalization  νSVR  LLM  

10fold CV  A  sum  additional file 10: asumsvrcross  additional file 11: allmcross 
mic  additional file 12: amicsvrcross  additional file 11: allmcross  
B  sum  additional file 13: bsumsvrcross  additional file 14: bllmcross  
mic  additional file 15: bmicsvrcross  additional file 14: bllmcross  
across datasets  A  sum  additional file 16: asumsvrvalidation  additional file 17: allmvalidation 
mic  additional file 18: amicsvrvalidation  additional file 17: allmvalidation  
B  sum  additional file 19: bsumsvrvalidation  additional file 20: bllmvalidation  
mic  additional file 21: bmicsvrvalidation  additional file 20: bllmvalidation 
The scatter plots of target vs. predicted values in Fig. 2 are typical for dataset A. The crossvalidation plot shows a point cloud that is almost diagonal and shows considerable spread especially for low values. The acrossdataset prediction plot (right side of Fig. 2) shows that values of A are systematically predicted too high when the model trained and parametertuned on dataset B is used. An analysis of the statistical properties of the target values of both datasets reveals that B has a higher mean and its distribution is more skewed towards higher values. This difference can be explained by the fact that both datasets were wetlab processed by different persons, dataset B by multiple persons even, and they were analyzed with different settings of the mass spectrometer. In applications of our method, an additional normalization step should be applied that accounts for these differences.
Dependence on peak intensity
Dependence on the learning architecture
Of all the learning architectures νSVR gives the best results in all cases, and the LLM's performance is comparable in the crossvalidation and slightly worse in the generalization case (Table 2). Where data becomes available gradually during experiments, we suggest applying the LLM in early stages: it is faster to train than the νSVR and can be adapted with additional data without timeconsuming complete retraining. When enough data has been collected and results begin to stabilize, the νSVR should be used to obtain a final prediction model. The LM shows bad overfitting: It beats the LLM for the aa feature set in the crossvalidation, but shows absolutely no correlation when generalizing to new peptides, and only very little correlation for the lowerdimensional feature sets. This suggests a nonlinear relationship between feature vectors and the target values for any of the peptide representations.
Influence of features sets
A comparison of different feature sets shows that the mono and sss feature sets generally lead to similar prediction results, the sss feature set having a slight advantage in more cases than the mono set. Thus, our feature selection increases accuracy slightly for most cases. For the 531dimensional aa feature set, only the νSVR successfully extracts the relevant information to achieve comparable prediction accuracy. For the other learning algorithms, the high dimensionality with some features being highly correlated to each other, leads to bad generalization performance, i.e. inability to predict intensities for new peptides. The sparse 9220dimensional seq feature set performs worse than any other feature set. While, in principle, this feature set captures more information about amino acid order, it is inappropriate for our small training datasets. This is a general problem: there are indications that not only the amino acid frequencies but also their order determine peak intensities. However, the amount of information necessary to capture this relationship explodes. Even if the amino acid order is encoded only partially, as in this case, more training data is needed.
Analysis of the selected features
The features selected most often in the feature selection on dataset A were the estimated gasphase basicity at 500 K (GB500), the absolute number of arginine residues (arginin_count), the relative population of conformational state E (VASM830103 [38]), the hydropathy scale based on selfinformation values in the twostate model at 36% accessibility (NADH010106 [39]), the hydrophobicity coefficient in RPHPLC, C8 with 0.1%TFA/MeCN/H_{2}O (WILM950102 [40], and the number of positive charges (FAUJ880111 [41]) of the aa feature set. From the seq feature set, the numbers of arginine (R), phenylalanine (F), and methionine (M) residues were selected most often. Table 1 shows the exact numbers. Details of the results of the forward stepwise selection (selected features and performance values) can be found in additional file 9: stepwise selection.
Forward stepwise selection is a greedy method and does not find an optimal solution. None of the selected sets from each single run of the method leads to better performance than that of the original aa set. Thus, we add in other features that extend the description of the peptide.
In general, the automatically selected features are of higher importance than handpicked ones (Fig. 5). Of the handpicked features, the mass and Kerrconstant increments (KHAG800101 [44]) show the highest increase in error, i.e. are most important.
The high importance of VASM830103 indicates that the amino acids' conformation might play a role for peptidespecific instrument sensitivity. However, not much is known about the conformation of peptides after tryptic digestion, when crystallized within the matrix, or as ions in gasphase. Looking at the chemistry, it makes sense that the gasphase basicity influences ionization efficiency. The number of positive charges have been reported by Mallick et al. to be relevant for the probability of observing a peptide ion [14]. The latter has often been chosen in the feature selection but is the least important one in the sss feature set according to our feature importance accession. The number of histidine residues (H) has been found to be correlated with detection probabilities in MALDI MS by Mallick et al.. It is the only basic residue except for K and R, presumably making a difference, though weakly, in basicity for tryptic, i.e. already quite basic peptides. However, it is one of the three least important features according to the random forest method in our dataset. We can exclude the correlation (r = 0.922) between H and FAUJ880111 as the cause of their low importance: the importance ranking is the same if one of both is left out completely. The hydrophobicity features (WILM950102, NADH010106) are of intermediate importance. Since all except a few peptides in the studied datasets are tryptic, i.e. have lysine (K) or arginine (R) as the Cterminal amino acid, the R feature value almost always indicates whether the peptide string ends in R or not. In the latter case, it most certainly ends in K. The feature R is highly correlated with the GB500 value: the estimated gasphase basicity takes on values distributed around three levels, which correspond to peptides ending in R (highest), those ending in K, and those that have neither. Nonetheless, both features seem to be complementary according to our results.
Twosample ttest results
substring s  pvalue  μ(${s}_{A}^{}$)  μ(${s}_{A}^{+}$)  size(${s}_{A}^{}$)  size(${s}_{A}^{+}$) 

R  2.41e15  2.76  3.73  171  244 
K  2.85e14  3.72  2.78  243  172 
M  1.69e06  3.44  2.53  366  49 
H  2.25e05  3.20  3.90  338  77 
VKe  4.64e05  3.36  2.31  403  12 
VK  8.92e05  3.36  2.39  402  13 
VF  1.25e04  3.29  4.65  403  12 
Y  2.11e04  3.18  3.71  296  119 
F  2.99e04  3.15  3.67  272  143 
GF  5.83e04  3.29  4.48  400  15 
Q  8.97e04  3.16  3.61  260  155 
GKe  0.001  3.37  2.62  392  23 
TKe  0.001  3.36  2.45  401  14 
SV  0.001  3.28  4.15  393  22 
TK  0.003  3.36  2.53  400  15 
GK  0.009  3.37  2.76  390  25 
PR  0.009  3.30  4.51  403  12 
PRe  0.009  3.30  4.51  403  12 
DS  0.009  3.29  4.10  395  20 
substring s  pvalue  μ(${s}_{B}^{}$)  μ(${s}_{B}^{+}$)  size(${s}_{B}^{}$)  size(${s}_{B}^{+}$) 
R  4.12e33  3.69  4.64  339  795 
K  6.76e31  4.65  3.74  770  364 
M  3.72e25  4.52  3.41  966  168 
DK  3.80e07  4.38  3.35  1112  22 
DKe  1.18e06  4.37  3.38  1113  21 
GM  1.67e05  4.38  3.23  1112  22 
AKe  2.27e05  4.39  3.64  1085  49 
NKe  5.82e05  4.38  3.42  1111  23 
GRe  9.18e05  4.31  4.93  1054  80 
QRe  1.37e04  4.33  5.10  1100  34 
W  2.63e04  4.40  3.93  1034  100 
AK  2.75e04  4.39  3.73  1083  51 
NK  3.03e04  4.37  3.51  1110  24 
GR  6.16e04  4.32  4.86  1051  83 
QR  7.46e04  4.34  5.01  1098  36 
FRe  0.001  4.34  5.28  1111  23 
IK  0.001  4.37  3.47  1113  21 
IKe  0.001  4.37  3.47  1113  21 
GKe  0.001  4.37  3.74  1104  30 
GK  0.002  4.37  3.80  1101  33 
DT  0.003  4.38  3.80  1083  51 
AM  0.003  4.38  3.40  1108  26 
S  0.004  4.47  4.25  538  596 
P  0.006  4.25  4.46  579  555 
TKe  0.008  4.37  3.76  1110  24 
VRe  0.008  4.33  4.77  1069  65 
Dataset dependence
Generally, dataset A gives much better results than B, even though the latter is larger (see Table 2 as well as additional files of A and B scatter plots referred to in Table 3). An obvious reason is the higher withinpeptide variance of normalized intensities and the higher fraction of peptides without replicate measurements in dataset B. Nonetheless, the algorithms are able to draw the main trends from B, since A can be predicted with a model trained on B even better than B itself (compare "across datasets" for A to "10fold CV" for B in Table 2). This shows that despite the high variance in a dataset, the learning algorithms can capture the main trends, thus enabling prediction of a less noisy dataset.
Conclusion
The machine learning approach presented here is able to predict peptide peak intensities in MALDI mass spectra, thus showing that the prediction of sensitivity factors for mass spectrometry based on chemical and sequence features is feasible. Even for small datasets, significant correlations can be achieved.
Although we have not yet evaluated our method against uncorrected spectral counts and correction based on detection probabilities, we believe that our method is better suited for peptide quantification than the latter, since we directly aim at predicting the reciprocal of the peptidespecific instrument sensitivity. As the next step in our research, we want to assess how predicted intensities affect quantification accuracy. We will apply our method to protein mixtures of known concentrations to estimate how well we can predict protein concentrations at the semiquantitative level. Also, the application of the method to other ionization techniques, in particular data from electrospray ionization (LCESI), is an obvious next step. We are confident that we can improve prediction performance using larger datasets with more replicates. Regarding additional features to be included for ML, features comprising conformational aspects of peptides appear very promising. Due to the high computational costs, it will have to be assessed whether the estimation of such features is feasible for this application.
Declarations
Acknowledgements
AS funded by Deutsche Forschungsgemeinschaft (BO 1910/31), WT funded by International NRW Graduate School for Bioinformatics and Genome Research. The authors thank Andreas Wilke, Nicole Hansmeier, Christian Rückert, and Jörn Kalinowski for providing the spectra and MASCOT identification.
Authors’ Affiliations
References
 Bantscheff M, Schirle M, Sweetman G, Rick J, Kuster B: Quantitative mass spectrometry in proteomics: a critical review. Anal Bioanal Chem 2007.Google Scholar
 Ong SE, Blagoev B, Kratchmarova I, Kristensen DB, Steen H, Pandey A, Mann M: Stable isotope labeling by amino acids in cell culture, SILAC, as a simple and accurate approach to expression proteomics. Mol Cell Proteomics 2002, 1(5):376–386. 10.1074/mcp.M200025MCP200View ArticlePubMedGoogle Scholar
 Gygi SP, Rist B, Gerber SA, Turecek F, Gelb MH, Aebersold R: Quantitative analysis of complex protein mixtures using isotopecoded affinity tags. Nat Biotechnol 1999, 17(10):994–999. 10.1038/13690View ArticlePubMedGoogle Scholar
 Ross PL, Huang YN, Marchese JN, Williamson B, Parker K, Hattan S, Khainovski N, Pillai S, Dey S, Daniels S, Purkayastha S, Juhasz P, Martin S, BartletJones M, He F, Jacobson A, Pappin DJ: Multiplexed protein quantitation in Saccharomyces cerevisiae using aminereactive isobaric tagging reagents. Mol Cell Proteomics 2004, 3(12):1154–1169. 10.1074/mcp.M400129MCP200View ArticlePubMedGoogle Scholar
 Yao X, Freas A, Ramirez J, Demirev PA, Fenselau C: Proteolytic 18O labeling for comparative proteomics: model studies with two serotypes of adenovirus. Anal Chem 2001, 73(13):2836–2842. 10.1021/ac001404cView ArticlePubMedGoogle Scholar
 America AHP, Cordewener JHG: Comparative LCMS: a landscape of peaks and valleys. Proteomics 2008, 8(4):731–749. 10.1002/pmic.200700694View ArticlePubMedGoogle Scholar
 Mayr BM, Kohlbacher O, Reinert K, Sturm M, Gröpl C, Lange E, Klein C, Huber CG: Absolute myoglobin quantitation in serum by combining twodimensional liquid chromatographyelectrospray ionization mass spectrometry and novel data analysis algorithms. J Proteome Res 2006, 5(2):414–421. 10.1021/pr050344uView ArticlePubMedGoogle Scholar
 Gerber SA, Rush J, Stemman O, Kirschner MW, Gygi SP: Absolute quantification of proteins and phosphoproteins from cell lysates by tandem MS. Proc Natl Acad Sci USA 2003, 100(12):6940–6945. 10.1073/pnas.0832254100PubMed CentralView ArticlePubMedGoogle Scholar
 Neubert H, Bonnert T, Rumpel K, Hunt B, Henle E, James I: LabelFree Detection of Differential Protein Expression by LC/MALDI Mass Spectrometry. J Proteome Res 2008.Google Scholar
 Mirgorodskaya E, Braeuer C, Fucini P, Lehrach H, Gobom J: Nanoflow liquid chromatography coupled to matrixassisted laser desorption/ionization mass spectrometry: sample preparation, data analysis, and application to the analysis of complex peptide mixtures. Proteomics 2005, 5(2):399–408. 10.1002/pmic.200400984View ArticlePubMedGoogle Scholar
 Ji C, Li L: Quantitative Proteome Analysis Using Differential Stable Isotopic Labeling and Microbore LCMALDI MS and MS/MS. Journal of Proteome Research 2005, 4(3):734–742. 10.1021/pr049784wView ArticlePubMedGoogle Scholar
 Lu P, Vogel C, Wang R, Yao X, Marcotte EM: Absolute protein expression profiling estimates the relative contributions of transcriptional and translational regulation. Nat Biotechnol 2007, 25: 117–124. 10.1038/nbt1270View ArticlePubMedGoogle Scholar
 Tang H, Arnold RJ, Alves P, Xun Z, Clemmer DE, Novotny MV, Reilly JP, Radivojac P: A computational approach toward labelfree protein quantification using predicted peptide detectability. Bioinformatics 2006, 22(14):e481e488. 10.1093/bioinformatics/btl237View ArticlePubMedGoogle Scholar
 Mallick P, Schirle M, Chen SS, Flory MR, Lee H, Martin D, Ranish J, Raught B, Schmitt R, Werner T, Kuster B, Aebersold R: Computational prediction of proteotypic peptides for quantitative proteomics. Nat Biotechnol 2007, 25: 125–131. 10.1038/nbt1275View ArticlePubMedGoogle Scholar
 Gay S, Binz PA, Hochstrasser DF, Appel RD: Peptide mass fingerprinting peak intensity prediction: extracting knowledge from spectra. Proteomics 2002, 2(10):1374–1391. 10.1002/16159861(200210)2:10<1374::AIDPROT1374>3.0.CO;2DView ArticlePubMedGoogle Scholar
 Hansmeier N, Chao TC, Pühler A, Tauch A, Kalinowski J: The cytosolic, cell surface and extracellular proteomes of the biotechnologically important soil bacterium Corynebacterium efficiens YS314 in comparison to those of Corynebacterium glutamicum ATCC 13032. Proteomics 2006, 6: 233–250. 10.1002/pmic.200500144View ArticlePubMedGoogle Scholar
 Pappin DJ, Hojrup P, Bleasby AJ: Rapid identification of proteins by peptidemass fingerprinting. Curr Biol 1993, 3(6):327–332. 10.1016/09609822(93)90195TView ArticlePubMedGoogle Scholar
 Savitzky A, Golay JEM: Smoothing and Differentiation of Data by Simplified Least Squares Procedures. Anal Chem 1964, 36: 1627–1639. 10.1021/ac60214a047View ArticleGoogle Scholar
 Radulovic D, Jelveh S, Ryu S, Hamilton TG, Foss E, Mao Y, Emili A: Informatics platform for global proteomic profiling and biomarker discovery using liquid chromatographytandem mass spectrometry. Mol Cell Proteomics 2004, 3(10):984–997. 10.1074/mcp.M400061MCP200View ArticlePubMedGoogle Scholar
 Listgarten J, Emili A: Statistical and computational methods for comparative proteomic profiling using liquid chromatographytandem mass spectrometry. Mol Cell Proteomics 2005, 4(4):419–434. 10.1074/mcp.R500005MCP200View ArticlePubMedGoogle Scholar
 Anderle M, Roy S, Lin H, Becker C, Joho K: Quantifying reproducibility for differential proteomics: noise analysis for protein liquid chromatographymass spectrometry of human serum. Bioinformatics 2004, 20(18):3575–3582. 10.1093/bioinformatics/bth446View ArticlePubMedGoogle Scholar
 Buhrman D, Price P, Rudewicz P: Quantitation of SR 27417 in Human Plasma Using Electrospray Liquid ChromatographyTandem Mass Spectrometry: A Study of Ion Suppression. J Amer Soc Mass Spectrom 1996, 7: 1099–1105. 10.1016/S10440305(96)000724View ArticleGoogle Scholar
 Schölkopf B, Bartlett P, Smola A, Williamson R: Shrinking the Tube: A New Support Vector Regression Algorithm. Advances in Neural Information Processing Systems 1999. [http://users.rsise.anu.edu.au/~williams/papers/P105.pdf]Google Scholar
 Burges CJ: A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery 1998, 2: 121–167. 10.1023/A:1009715923555View ArticleGoogle Scholar
 Ritter H: Learning with the SelfOrganizing Map. In Artificial Neural Networks. Edited by: TK et al. Amsterdam: Elsevier Science Publishers; 1991:379–384.Google Scholar
 Chambers JM, Hastie TJ, (Eds): Statistical Models in S, Linear models. Volume 4. Wadsworth & Brooks/Cole; 1992.Google Scholar
 Vapnik VN: The Nature of Statistical Learning Theory. 1st edition. Springer; 1995.View ArticleGoogle Scholar
 R Development Core Team:R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria; 2006. [http://www.Rproject.org]Google Scholar
 Dimitriadou E, Hornik K, Leisch F, Meyer D, Weingessel A:The e1071 Package. Department of Statistics (e1071), TU Wien; 2006. Friedrich.Leisch@ci.tuwien.ac.at [Manual for R package e1071] [http://cran.cnr.berkeley.edu/]Google Scholar
 Kohonen T: SelfOrganized Formation of Topologically Correct Feature Maps. Biological Cybernetics 1982, 43: 59–69. 10.1007/BF00337288View ArticleGoogle Scholar
 Scherbart A, Timm W, Böcker S, Nattkemper TW: SOMbased Peptide Prototyping for Mass Spectrometry Peak Intensity Prediction. WSOM'07 2007. [http://biecoll.ub.unibielefeld.de/frontdoor.php?source_opus=150&la=en] 10.2390/biecollwsom2007157Google Scholar
 Cleveland WS, Devlin SJ: LocallyWeighted Regression: An Approach to Regression Analysis by Local Fitting. Journal of the American Statistical Association 1988, 83: 596–610. 10.2307/2289282View ArticleGoogle Scholar
 Millington PJ, Baker WL: Associative Reinforcement Learning for Optimal Control. Proc Conf on AIAA Guid Nav and Cont 1990, 2: 1120–1128. [http://dspace.mit.edu/handle/1721.1/13830?show=full]Google Scholar
 Hastie T, Loader C: Local regression: Automatic kernel carpentry. Statistical Science 1993. [http://www.jstor.org/pss/2246148]Google Scholar
 Kawashima S, Ogata H, Kanehisa M: AAindex: Amino Acid Index Database. Nucleic Acids Res 1999, 27: 368–369. 10.1093/nar/27.1.368PubMed CentralView ArticlePubMedGoogle Scholar
 Zhang Z: Prediction of lowenergy collisioninduced dissociation spectra of peptides. Anal Chem 2004, 76(14):3908–3922. 10.1021/ac049951bView ArticlePubMedGoogle Scholar
 Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer; 2001.View ArticleGoogle Scholar
 Vásquez M, Némethy G, Scheraga HA: Computed Conformational States of the 20 Naturally Occuring Amino Acid Residues and of the Prototype Residue α Aminobutyric Acid. Macromolecules 2001, 16: 1043–1049. 10.1021/ma00241a004View ArticleGoogle Scholar
 NaderiManesh H, Sadeghi M, Arab S, Movahedi AAM: Prediction of protein surface accessibility with information theory. Proteins 2001, 42(4):452–459. 10.1002/10970134(20010301)42:4<452::AIDPROT40>3.0.CO;2QView ArticlePubMedGoogle Scholar
 Wilce MCJ, Aguilar MI, Hearn MTW: Physicochemical Basis of Amino Acid Hydrophobicity Scales: Evaluation of Four New Scales of Amino Acid Hydrophobicity Coefficients Derived from RPHPLC of Peptides. Analytical chemistry 1995, 67(7):1210–1219. 10.1021/ac00103a012View ArticleGoogle Scholar
 Fauchére JL, Charton M, Kier LB, Verloop A, Pliska V: Amino acid side chain parameters for correlation studies in biology and pharmacology. Int J Pept Protein Res 1988, 32(4):269–278.View ArticlePubMedGoogle Scholar
 Breiman L: Random Forests. Machine Learning 2001, 45(1):5–32. 10.1023/A:1010933404324View ArticleGoogle Scholar
 Breiman L:Manual On Setting Up, Using, And Understanding Random Forests V3.1. 2002. [http://oz.berkeley.edu/users/breiman/Using_random_forests_V3.1.pdf]Google Scholar
 Khanarian G, Moore WJ: The Kerr Effect of Amino Acids in Water. Aust J Chem 1980, 33: 1727–1741.View ArticleGoogle Scholar
 Meyer F, Goesmann A, McHardy AC, Bartels D, Bekel T, Clausen J, Kalinowski J, Linke B, Rupp O, Giegerich R, PÜhler A: GenDB – an open source genome annotation system for prokaryote genomes. Nucleic Acids Res 2003, 31(8):2187–2195. 10.1093/nar/gkg312PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.