Isotope pattern deconvolution for peptide mass spectrometry by non-negative least squares/least absolute deviation template matching

Slawski, Martin; Hussong, Rene; Tholey, Andreas; Jakoby, Thomas; Gregorius, Barbara; Hildebrandt, Andreas; Hein, Matthias

doi:10.1186/1471-2105-13-291

Methodology article
Open access
Published: 08 November 2012

Isotope pattern deconvolution for peptide mass spectrometry by non-negative least squares/least absolute deviation template matching

Martin Slawski¹,
Rene Hussong^2,3,
Andreas Tholey⁴,
Thomas Jakoby⁴,
Barbara Gregorius^2,4,
Andreas Hildebrandt^2,5 &
…
Matthias Hein¹

BMC Bioinformatics volume 13, Article number: 291 (2012) Cite this article

9441 Accesses
12 Citations
6 Altmetric
Metrics details

Abstract

Background

The robust identification of isotope patterns originating from peptides being analyzed through mass spectrometry (MS) is often significantly hampered by noise artifacts and the interference of overlapping patterns arising e.g. from post-translational modifications. As the classification of the recorded data points into either ‘noise’ or ‘signal’ lies at the very root of essentially every proteomic application, the quality of the automated processing of mass spectra can significantly influence the way the data might be interpreted within a given biological context.

Results

We propose non-negative least squares/non-negative least absolute deviation regression to fit a raw spectrum by templates imitating isotope patterns. In a carefully designed validation scheme, we show that the method exhibits excellent performance in pattern picking. It is demonstrated that the method is able to disentangle complicated overlaps of patterns.

Conclusions

We find that regularization is not necessary to prevent overfitting and that thresholding is an effective and user-friendly way to perform feature selection. The proposed method avoids problems inherent in regularization-based approaches, comes with a set of well-interpretable parameters whose default configuration is shown to generalize well without the need for fine-tuning, and is applicable to spectra of different platforms. The R package IPPD implements the method and is available from the Bioconductor platform (http://bioconductor.fhcrc.org/help/bioc-views/devel/bioc/html/IPPD.html).

Background

Mass spectrometry (MS), often in conjunction with high performance liquid chromatography (HPLC), is the de-facto standard analytical tool to derive important biological knowledge about the protein content of whole cells, organelles, or biomedical samples like tumour or blood plasma. Within a typical experimental setup, purified proteins of the sample under study are digested by an enzyme. Before entering the mass spectrometer, peptides are separated chromatographically according to their physico-chemical properties in order to avoid a massive overlapping of peptide signals within a single scan. Nevertheless, due to the sheer number of peptides present in a sample, interfering patterns still occur frequently, not least because of post-translational modifications such as the deamidation of asparagines or glutamine residues. In order to obtain an unambiguous assignment of the signals, and in particular their isotope patterns, which is a prerequisite for a proper identification and quantification, every data point in m/zdimension is classified either as ‘signal’ or as ‘noise’ during the so-called feature detection phase. As this processing lies at the very root of every proteomic application, the quality of feature detection can have dramatic impact on the finally derived results and conclusions. In view of the large amount of data even a single MS experiment can produce, automated analysis is indispensable. However, due to various artifacts arising from electric and chemical noise and baseline trends, the identification of isotope patterns is error-prone and time consuming. In addition, severe overlaps of peptide signals within the same mass spectrometric scan can hamper a straightforward analysis furthermore. In recent years, numerous procedures have been developed to process this data (cf., e.g., [1–8]). Within this paper, we propose a novel method that is demonstrated to perform especially well in challenging situations, characterized e.g. by strong local variations in noise and intensity levels or the presence of isotope patterns of different charges exhibiting overlap, which in many cases may be difficult to resolve even for a human expert by visual inspection. Existing software typically depends on a large set of parameters requiring careful fine-tuning, often being rather sensitive to changes in the measurement process like the change of the platform, which makes a proper parameter choice a laboursome task. In contrast, the proposed method has been designed to depend on a comparatively small set of well-interpretable parameters whose default configuration is shown to be robust, yielding mostly excellent, but at least competitive performance on spectra of different platforms. In a nutshell, our method uses non-negative least squares or non-negative least absolute deviation regression to fit a spectrum s by a large dictionary of templates mimicking isotope patterns; since true positions and charges of isotope patterns in the spectrum are unknown in advance, regions where the signal exceeds a local measure of noise are identified and then a vast set of templates is placed in those regions. In the spirit of sparse recovery, a small subset of the templates, which reasonably explains the observed signal, is selected by applying hard thresholding with a locally adaptive choice of the threshold to the regression coefficients obtained previously. Our method is related to a formerly proposed template-based approach (NITPICK, [3]). As opposed to the present work, NITPICK uses ℓ₁-regularized non-negative least squares. Without non-negativity constraints, this procedure is known as the lasso [9]. Reference [10] contains the first application of the lasso to the problem studied in the present paper. Given a dramatic increase in occurrence of high-dimensional datasets in recent years and the resulting need for feature selection, the lasso, due to computationally and theoretically appealing properties, has meanwhile become so popular that it can be regarded as a standard tool of modern data analysis [11]. In this respect, NITPICK follows the usual paradigm suggesting that ℓ₁-regularization is the method of choice. In the present paper, we argue for a deviation from that paradigm mainly in view of the following two aspects. First, a major benefit of our fitting+thresholding approach is that parameter choice is more user-friendly, since the threshold can be interpreted in terms of a signal-to-noise ratio. This is unlike the regularization parameter of the lasso, which can in general not be related directly to the signal. In the presence of heterogeneous noise and model misspecifications, the ‘right amount’ of regularization is notoriously difficult to choose. Second, there is a substantial body of work showing that non-negativity constraints alone may suffice to recover a sparse target. Non-negative least squares + thresholding is analyzed in [12], where it is shown that it can significantly outperform the usual ℓ₁-approach with respect to sparse recovery. See Section “Sparse recovery with non-negativity constraints: non-negative least squares + thresholding vs. the non-negative lasso” for a detailed discussion.

Methods

A spectrum is understood as a sequence of pairs ${(x_{i}, y_{i})}_{i = 1}^{n}$ , where x_i=m_i/z_i is a mass (m_i, measured in Dalton Da) to charge (z_i), and y_iis the intensity, i.e. the abundance of a particular mass (modulo charge state), observed at x_i, i=1,…,n, which are assumed to be ordered increasingly.

Template model

The ${(y_{i})}_{i = 1}^{n} = y$ are modeled as a positive combination of templates designed on the basis of prior knowledge about peak shape and composition of isotope patterns. If our model were perfectly correct, we could write

y = Φ β^{*} = \sum_{c = 1}^{C} Φ_{c} β_{c}^{*}, Φ_{c} = [φ_{c, 1} \dots φ_{c, p_{c}}], c = 1, \dots, C,

(1)

where Φ is a non-negative matrix of templates and β^∗ is a non-negative coefficient vector. Both Φ and β^∗ can be arranged according to charge states c=1,…,C. Each sub-matrix Φ_c can in turn be divided into columns $φ_{c, 1}, \dots, φ_{c, p_{c}}$ , where the entries of each column vector store the evaluations of a template φ_c,j, j=1,…,p_c, at the x_i, i=1,…,n. It is assumed that only a small fraction of the templates in Φ are needed to represent the signal, i.e. β^∗is highly sparse. The templates are of the form

φ_{c, j} = \sum_{k \in} a_{c, j, k} ψ_{c, j, k, 휃_{c, j}},

(2)

where the ψ_c,j,kare functions representing a single peak within an isotope pattern, depending on a location m_c,j and a parameter vector θ_c,j. In general, peaks can be modeled by Gaussian, Lorentzian, and sech² shapes, cf. [13]. Due to their similarity, we restrict ourselves to the Gaussian, but provide in addition the exponentially modified Gaussian (EMG, cf., e.g., [14]), a model for a possibly skewed peak as occuring frequently in MALDI-TOF recordings, where late ion formation in the gas phase leads to tailed peaks [15]. The EMG is parameterized by $휃_{c, j} = {(α_{c, j}, σ_{c, j}, μ_{c, j})}^{⊤} \in R^{+} \times R^{+} \times R$ (for α_c,j↓ 0, one obtains a Gaussian)

\begin{align} ψ_{c, j, k} (x) = \frac{1}{α_{c, j}} exp (\frac{σ_{c, j}^{2}}{2 α_{c, j}^{2}} + \frac{μ_{c, j} - (x - m_{c, j, k})}{α_{c, j}}) \\ \times (1 - F (\frac{σ_{c, j}}{α_{c, j}} + \frac{μ_{c, j} - (x - m_{c, j, k})}{σ_{c, j}})), \\ F (t) = \int_{- \infty}^{t} \frac{1}{\sqrt{2 Π}} exp (- \frac{u^{2}}{2}) du. \end{align}

(3)

In (2), the nonnegative weights a_c,j,kequal the height of the isotopic peak k within the pattern j of charge state c. These heights are computed according to the averagine model [16]. The m_c,j,k are calculated from m_c,jas $m_{c, j, k} = m_{c, j} + κ \frac{k}{c}$ , where κusually ranges between 1.002 and 1.008 Dalton, see e.g. [17]. Note that in Eq. (2) the location of the most intense peak (a_c,j,0=max_ka_c,j,k) is taken as characteristic location of the template instead of using the finally reported monoisotopic position: we set m_c,j,0=m_c,j so that the remaining m_c,j,k, k≠0, are computed by shifting m_c,j in both directions along the m/z axis. With the normalization max_xφ_c,j(x)=1 for all c,j, the entries of β^∗can be interpreted as intensities of the most intense peaks of the templates. The construction scheme is illustrated in Figure 1.

Parameter estimation

The parameters θ_c,j=(α_c,jσ_c,jμ_c,j)^⊤of the peaks (3) are unknown in practice. Following a central paradigm of our framework, which is to relieve the user of performing laboursome fine-tuning of parameters, we have developed a systematic procedure automatically providing estimates of these parameters, which is considerably more efficient and flexible than a grid search. For instance, the parameters may additionally depend on the m/z-position. Our framework for parameter estimation extends a conceptually similar approach in [18] designed for a Gaussian peak shape.

In a first step, we apply a simple peak detection algorithm to the spectrum to identify disjoint regions $R_{r} \subset {1, \dots, n}, r = 1, \dots, R$ , of well-resolved peaks. For each region, we fit the chosen peak shape to the data ${(x_{i}, y_{i})}_{i \in R_{r}}$ using nonlinear least squares:

min_{휃} \sum_{i \in R_{r}} {(y_{i} - ψ_{휃} (x_{i}))}^{2},

(4)

yielding an estimate ${\hat{휃}}_{r} ({\hat{x}}_{r})$ , where ${\hat{x}}_{r}$ denotes an estimation for the mode of the peak in region $R_{r}$ . This concept is sketched in Figure 2. The nonlinear least squares problem (4) is solved by using a general purpose nonlinear least squares routine available in most scientific computing environments, e.g. nls in R. Once the sequence of estimators ${{\hat{휃}}_{r} ({\hat{x}}_{r})}$ has been obtained, they are subject to a suitable aggregation procedure. In the simplest case, one could simply take averages. For spectra where peak shape characteristics, in particular peak width, are known to vary systematically with m/z position, we use the pairs ${({\hat{x}}_{r}, {\hat{휃}}_{r} ({\hat{x}}_{r}))}$ as input into a linear regression procedure to infer the parameters of pre-specified trend functions. Formally, we model each component θ_l of θ as a linear combination of known functions g_l,mof x=m/z and an error component ε_l, i.e.

휃_{l} (x) = \sum_{m = 1}^{M_{l}} ν_{l, m} g_{l, m} (x) + ε_{l} (x),

(5)

for which a linear trend i.e. θ_l(x)=ν_l,1 + ν_l,2x, is one of the most common special cases. In [19], a set of instrument-specific models for the peak width is provided, all of which can be fitted by our approach.

We refrain from using least squares regression to determine the parameters in (5) due to its sensitivity to possible outliers, which arise from poorly resolved, wiggly or overlapping isotope patterns, which may affect the quality of the estimates ${\hat{휃}}_{r}$ . Therefore, the linear model is fitted in a robust way by using least absolute deviation regression. Given the resulting estimates of the parameters {ν_l,m}, m/z-specific estimates for the parameters in (3) are obtained by evaluating (5).

Template fitting

The computation of the design matrix Φrequires a set of m/zpositions at which templates are placed. In general, one has to choose positions from the interval [x₁,x_n]. We instead restrict ourselves to a suitable subset of the set ${x_{i}}_{i = 1}^{n}$ . The deviations from the positions of the true underlying isotope patterns is then at least in the order of the sampling rate, but this can be improved by means of a postprocessing step described in Section “Postprocessing and thresholding”. Using the whole set ${x_{i}}_{i = 1}^{n}$ may be computationally infeasible if n is large and is in fact not necessary since isotope patterns occur very sparsely in the spectrum. Therefore, we apply a pre-selection step on the basis of what we term ‘local noise level’ (LNL). The LNL is defined as the median of the intensities y_ifalling into a sliding window of fixed width around a specific position. For x∈[x₁,x_n], we define the local noise level based on sliding window width h as

\begin{align} LNL (x) = median ({y_{i} : i \in I_{x}}), \\ I_{x} = {i : x_{i} \in [x - h, x + h]} . \end{align}

(6)

Given the LNL, we place templates at position x_i(one for each charge state) if the corresponding y_iexceeds LNL(x_i)by a factor factor.place. Section “Finding a set of default parameters” describes how we determined defaults for the two parameters h and factor.place. In fact, the LNL is a central quantity in our framework, because it does not only influence the placement, but also the selection of templates (see Section “Postprocessing and thresholding” below). Choosing h too small typically has the effect that the LNL is overestimated such that true peaks might be incorrectly classified as noise. Conversely, choosing h too large leads to an underestimation, thereby increasing the computational burden as well as the number of spurious patterns included in the final list. The advantages of working with the median are obvious: easy computation, robustness and equivariance with respect to monotone transformations. Similar notions of local noise can be found in the literature, see e.g. [8] where a truncated mean is used. Given the positions of the templates, we generate the matrix Φaccording to Eqs. (1) and (2). In the fitting step, we compute a non-negative least squares (q=2) or alternatively non-negative least absolute deviation (q=1) fit by determining a minimizer $\hat{β}$ of the criterion

\begin{align} min_{β \geq 0} {∥y - Φ β∥}_{q}^{q}, q = 1 or q = 2, \end{align}

(7)

The optimization problem (7) is a quadratic (q=2) or linear (q=1) program and is solved using interior point methods (e.g. [20]). The details are relegated to Appendix “Fitting with non-negativity constraints” section. As far as the choice of q is concerned, we point out that q=1 yields a robust fit that can deal better with deviations from model assumptions, i.e. deviations from the averagine model or from the peak model. However, in general, we are unable to provide any recommendation about how to choose q. Therefore, in our validation, both are evaluated.

Comparison with pepex

In prior work [21], subsequently referred to as ‘pepex’, non-negative least squares fitting is used as well. An important difference to our approach is that the matrix Φ is not constructed from the convolution of isotope distributions and peak shapes as described in Section “Template model”. Instead, peak detection is applied first to reduce the raw intensity data to peak clusters, a step that is usually referred to as centroiding. At the second stage, called de-isotoping, peak clusters are fitted by a design matrix containing isotope distributions themselves, not convolved versions. While the approach is computationally more attractive and avoids estimation of peak shape parameters (cf. Section “Parameter estimation”), the division into centroiding and de-isotoping may lead to poor performance for low resolution and noisy data, or in the presence of overlapping patterns. In these cases, peak detection is little reliable. In our template-based approach, there is no separation of centroiding and de-isotoping. It performs much better in the aforementioned cases, since it operates directly on the data and is hence less affected if single peaks of a pattern are difficult to detect. This reasoning is supported by our evaluation in Section “Results and discussion” as well as that in [3]. At the same time, our approach can in principle be applied to centroided spectra as well. In this case, the columns of the matrix Φ directly represent isotope distributions instead of isotopic patterns.

Postprocessing and thresholding

While indeed a considerable fraction of the entries of $\hat{β}$ are precisely equal to zero, treating all positions for which the corresponding entry differs from zero as locations of isotope patterns would yield a huge number of false positives, at least because of regions, in which noise fitting reduces the objective in (7). Therefore, the fitting step of the previous section is accompanied by a thresholding step, with the aim to separate signal from noise. However, fitting followed by thresholding alone does not lead to a proper output. The strategy could be successful if our template model were free of any kind of misspecification. Even when neglecting possible misfits of the averagine model, we still have to cope with two sources of systematic errors − a limited sampling rate and mismatches in the peak model. These are the main reasons for what we term ‘peak splitting’, referring to the phenomenon that several templates are used to fit precisely one pattern. Figure 3 illustrates the effect of sampling in a noiseless setting. In the top panel, the signal is sampled in such a way that the top of the peak is lost. When placing two templates at the two sampling points x_lx_uof maximum signal, non-negative least squares fitting attributes weights ${\hat{β}}_{l}, {\hat{β}}_{u}$ of roughly equal size to the templates. The postprocessing procedure outlined below yields a suitable correction. One might object that ‘peak splitting’ is a problem inherent in our entirely fitting-oriented approach (7) not incorporating any form of regularization. The bottom panel of Figure 3 shows the solution path of the non-negative lasso [22] given by ${\hat{β} (λ), λ \geq 0}, \hat{β} (λ) = {argmin}_{β \geq 0} {∥y - Φ β∥}_{2}^{2} + λ 1^{⊤} β$ . One obtains two nearly parallel trajectories, demonstrating that only a heavily biased fit, which would undesirably lead to the exclusion of additional smaller signals, could accomplish the selection of only one template.

To a large extent, ‘peak splitting’ can be corrected by means of the following merging procedure, which we regard as postprocessing of the fitting step (7) and which we apply prior to thresholding. Given an estimate $\hat{β}$ , we define ${\hat{ℳ}}_{c} = {m_{c, j} : {\hat{β}}_{c, j} > 0} \subset {x_{i}}_{i = 1}^{n}, c = 1, \dots, C,$ as the set of all template locations where the corresponding coefficient exceeds 0.

1.
Separately for each c, divide the sets ${\hat{ℳ}}_{c}$ into groups $G_{c, 1}, \dots, G_{c, G_{c}}$ of ‘adjacent’ positions. Positions are said to be adjacent if their distance on the m/z scale is below a certain tolerance ppm specified in parts-per-million, cf. Section “Finding a set of default parameters”. In the context of ‘peak splitting’, the templates at locations sharing the same group are assumed to fit precisely one true underlying peak.
2.
With the notation of Eq. (2), for each c=1,…,C, and for g=1,…,G _c, we solve the following optimization problem.
$\begin{matrix} ({\tilde{m}}_{c, g}, {\tilde{β}}_{c, g}) = \underset{β_{c, g}}{\underset{m_{c,g}}{argmin}} \int_{- \infty}^{\infty} {(\sum_{m_{c, j} \in G_{c, g}} {\hat{β}}_{c, j} ψ_{m_{c, j}} (x) - β_{c, g} ψ_{m_{c, g}} (x))}^{2} dx, \end{matrix}$
(8)

with the aim to find a location ${\tilde{m}}_{c, g}$ and a weight ${\tilde{β}}_{c, g}$ of the most intense peak $ψ_{{\tilde{m}}_{c, g}}$ within an isotope pattern φ_c,g approximating the fit of the most intense peaks ${ψ_{m_{c, j}} : m_{c, j} \in G_{c, g}}$ within the isotope patterns ${φ_{c, j} : m_{c, j} \in G_{c, g}}$ best in a least squares sense.

3.
One ends up with sets ${\tilde{ℳ}}_{c} = {{\tilde{m}}_{c, g}}_{g = 1}^{G_{c}}$ and coefficients ${{\tilde{β}}_{c, g}}_{g = 1}^{G_{c}}, c = 1, \dots, C$ .

The additional benefit of solving (8) in step two as compared to the selection of the template with the largest coefficient within each group as proposed in [3] is that we are able to determine the location of the pattern even more accurately as predetermined by a limited sampling rate, since in (8) we optimize the location over a continuum. The optimization problem (8) can be solved fast and accurately by sampling the integrand on a fine grid of points and then solving a nonlinear least squares problem with optimization variables m_c,g and β_c,g.

All candidate positions $({\tilde{m}}_{c, g}, {\tilde{β}}_{c, g})$ are assigned a signal-to-noise ratio

{ratio}_{c, g} = {GOF}_{+} ({\tilde{m}}_{c, g}) \cdot \frac{{\tilde{β}}_{c, g}}{{LNL}_{+} ({\tilde{m}}_{c, g})},

(9)

where ${LNL}_{+} ({\tilde{m}}_{c, g}) = max \{LNL ({\tilde{m}}_{c, g}), \frac{1}{4} median ({LNL (x_{i})}_{i = 1}^{n})\}$ is a truncated version of the local noise level, with a lower bound included to avoid that the denominator in (9) takes on tiny values in low-intensity regions. The factor ${GOF}_{+} ({\tilde{m}}_{c, g})$ represents a goodness-of-fit adjustment, a correction which aims at downweighting spurious peaks in low-intensity noise regions. These are not hard to distinguish from signal regions, which, in view of the presence of peak patterns, tend to be considerably regular. In order to spot noise regions, we fit the spectrum by single peaks (3) placed at each datum x_i, i=1,…,n, where the peak shape model, the associated peak shape parameters and the parameter q are chosen according to the choice made for template generation (Sections “Template model”) and template fitting (Section “Template fitting”), respectively. Denote the residuals of the resulting fit by ${r_{i}}_{i = 1}^{n}$ . A local measure of goodness-of-fit is defined by

{GOF}_{+} (x) = min \{1 - \frac{\sum_{i \in I_{x}} | r_{i} |^{q}}{\sum_{i \in I_{x}} | y_{i} |^{q}}, 0.5\},

(10)

The idea underlying this procedure is that in noise regions, the fit to the data will be poor, and consequently, the size of the residuals is expected to be large relative to the signal, hence leading to a low goodness-of-fit statistic. The truncation at 0.5 limits the influence of this correction. A final list is generated by checking whether the signal-to-noise ratios (9) exceed a ‘significance threshold’ t specified by the user. We do not give a general guideline for choosing t, because a reasonable choice is very specific to experimental conditions, e.g. the platform used and the composition of the spectrum. It is important to note that while t itself is constant, we take into account that the noise level is heterogeneous, since thresholding is based on the ratios (9), where the local noise level enters.

Finding a set of default parameters

Apart from the signal-to-noise threshold t, we have introduced the parameters window, i.e. the width h of the sliding window required for the computation of the local noise level (6), the template placement parameter factor.place and the parts-per-million tolerance ppm within which peaks are considered to be merged by the postprocessing procedure. With the exception of the threshold t, we have fixed all parameters to a default setting which we expect to give reasonable (albeit potentially suboptimal) results on spectra different from the ones analyzed here, without the need of manual tuning. In order to find such a default setting, we performed a grid search using only one selected spectrum of those described in Section “Datasets” below. While our default setting, which can be found in the HTML manual of the R package IPPD, already performs well, we recommend to do such a calibration to optimize the performance of our method.

Sparse recovery with non-negativity constraints: non-negative least squares + thresholding vs. the non-negative lasso

We believe that our preference for the first alternative is a major methodological contribution that has potential to impact related problems where non-negativity problems come into play. In the present section, we provide, at a high level, a series of arguments rooting in the statistics and signal processing literature that clarify our contribution and support our preference.

Linear models and usual paradigms in statistics

The fact that we favour non-negative least squares + thresholding may seem implausible since it questions or partially even contradicts paradigms about high-dimensional statistical inference. Consider the linear model

y \approx Φ β^{*}, y \in R^{n}, Φ \in â^{n \times p},

(11)

which corresponds to model (1), where ‘≈’ is used instead of ‘=’ to account for stochastic noise or model misspecifications. Linear models of the form (10) have been and continue to be objects of central interest in statistical modelling.

Classical work in statistics shows that under mild conditions if the number of sample n grows at a faster rate than the number of features p, the ordinary least squares estimator ${\hat{β}}^{ols} \to β^{*}$ (in probability) as n→∞.
Since many contemporary datasets, like the MS datasets of the present paper, are characterized by a large p, which is of the same order as n or even larger, the first bullet has considerably lost relevance. Translated to MS datasets, it provides a statement about the case where the resolution tends to infinity. Therefore, modern statistical theory studies regimes in which p is allowed to grow at a faster rate than n, with a focus on results that hold for finite sample sizes. These results hinge on some sort of sparsity assumption on β^∗, the simplest being that β^∗is zero except for some index set (support) of small cardinality. In this context, a multitude of results has been proved (see e.g. [23] for an overview) indicating that the lasso estimate ${\hat{β}}^{lasso}$ is a statistically optimal procedure in the sense that if the regularization parameter is chosen in the right way, the squared Euclidean distance $∥ {\hat{β}}^{lasso} - β^{*} ∥_{2}^{2}$ is nearly of the same order as that of an estimator one could construct if the non-zeroes of β^∗were known.

The second bullet provides quite some justification for NITPICK, which is based on the lasso. However, as detailed below, the italicized part can be critical. On the other hand, there are several results that support our approach.

The power of non-negativity constraints

It turns out that the non-negativity constraint β≥0imposed in non-negative least squares (NNLS) may lead to a drastically better performance than that of the ordinary least squares estimator in ‘large p’ situations provided Φsatisfies additional conditions. Roughly speaking, it is shown in [12] that if Φhas non-negative entries, which is fulfilled for the template matching problem of Section “Template model”, the NNLS estimator $\hat{β}$ does not overfit and is unique even in the singular case (p>n). These results indicate that NNLS may behave surprisingly well in a high-dimensional setup, without using ℓ₁-regularization, which is often propagated in the literature as basically the only option ([24], Section 16.2.2).
There are several recent papers [25–27] in the sparse recovery literature in which it is shown that a sparse, non-negative vector can be recovered from few linear measurements n≪p. In [12], these results are extended to a noisy setup. More specifically, it is shown that NNLS + thresholding can consistently recover β^∗and its support. Very recently, using similar conditions as in [12], Meinshausen [28] has established several guarantees of NNLS in a high-dimensional setup.

One should bear in mind that the non-negativity constraints are essential for our approach. Thresholding the unconstrained ordinary least squares estimator ${\hat{β}}^{ols}$ in general leads to poor results in the ‘large p’ situation.

Shortcomings of ℓ₁-regularization in theory

In [12], it is not only shown that NNLS + thresholding is a sound strategy to perform sparse recovery of a non-negative target, but also examples are given where the non-negative lasso is outperformed even if its regularization parameter is set to match theoretical results and regardless of whether subsequent thresholding as advocated in [29, 30] is used or not. In particular, inferiority of the lasso arises in the presence of small, yet significantly non-zero entries in β^∗. These are specifically affected by the non-negligible bias of ℓ₁-regularization [31]. It is important to note that the comparison in [12] does not contradict prior comparisons of the lasso (aka soft thresholding) and (hard) thresholding for orthonormal designs (Φ^⊤Φ=I) in [32, 33], where both approaches perform similarly well and non-negativity constraints are not particularly important. Orthonormal designs, which lead to greatly simplified estimation problem are not of interest in the context of the paper, since the template matrix Φis far from being orthonormal.

Shortcomings of ℓ₁-regularization in practice

The study in [12] is of more theoretical nature, since all constants of the problem, in particular the noise level, are known, so that the regularization parameter can be set in an optimal fashion. This can realistically not be accomplished in practice. Likewise, the information-theoretic criterion employed in [3] as well as the data-splitting approach of [34] rely on knowledge of the noise level, or a consistent estimate thereof, which is hard to obtain in the ‘large p’ situation [35]. In any case, the regularization parameter remains a quantity that is hard to grasp and hence hard to set for a practitioner, since it cannot be related directly to the signal. In contrast, the threshold t admits a straightforward interpretation.

Moreover, when using ℓ₁-regularization, data fitting and model selection are coupled. While this is often regarded as advantage, since model selection is performed automatically, we think that it is preferable to have a clear separation between data fitting and model selection, which is a feature of our approach. Prior to thresholding, the output of our fitting approach gives rise to a ranking which we obtain without the necessity to specify any parameter. Selection is completely based on a single fit simply by letting the the threshold vary. On the contrary, if one wants to reduce the number of features selected by the lasso, one resets the regularization parameter and solves a new optimization problem. Note that it is in general not possible to compute the entire solution path of the lasso [22] for the MS datasets used for the present paper, where the dimension of Φ is in the ten thousands so that the active set algorithm of [22] is prohibitively slow. In this regard, model selection by thresholding is computationally more attractive.

Results and discussion

For the assessment of the pattern picking performance, in total eight spectra generated by two different ionization methods, matrix assisted laser desorption/ionization (MALDI) and electrospray ionization (ESI), respectively, form the basis of the evaluation. While MALDI has been coupled to a time-of-flight (TOF) mass analyzer, ESI MS spectra have been recorded on both a linear ion trap (LTQ) and an Orbitrap mass analyzer. In addition, a series of spectra were prepared with the aim of investigating in detail the method’s performance in the presence of overlapping peptides.

Datasets

For MALDI mass spectra (Additional file 1), time of flight mass analysis was performed; spectra were recorded on an ABI MALDI-TOF/TOF 4800 instrument in positive ion mode using α-cyano-4-hydroxy-cinnamic acid (CHCA) as matrix. Nanospray ESI spectra (Additional file 2) were measured in positive ion mode on a Thermo LTQ Orbitrap Velos MS; both high resolution measurements using the Orbitrap mass analyzer (referred to as ‘Orbitrap’) and, alternatively, low resolution linear ion trap (IT) measurements were performed with this setup. This experiment has been chosen in order to demonstrate the utility of our method at different concentration levels, that it is robust with respect to changes in the data-generating process and that the method is capable of handling singly charged ions, the main form generated by MALDI MS, as well as higher charged ions formed in ESI MS. Tryptic digests (performed in 40 mM ammonium bicarbonate) of model proteins were used as analytes: bovine myoglobin and chicken egg lysozyme (10 and 500 fmol each) for MALDI-TOF experiments, and lysozyme (250 and 1000 fmol) for ESI experiments. Disulfide bonds were reduced with dithiothreitol (DTT) prior to alkylation, free cysteine residues were alkylated by iodacetamide. No further sample pretreatment was performed prior to MS analysis. When referring to these spectra, we omit that tryptic digests are given: e.g., the term ‘MALDI-TOF myoglobin spectrum (500 fmol)’ means the respective tryptic digest.

To demonstrate explicitly the method’s ability to separate strongly overlapping patterns even in the case of badly resolved signals, 22 additional spectra have been generated in positive ion mode on a Bruker Daltonics HCT Ultra Ion Trap MS with an electrospray ion source. Three synthetic peptides (cf. Section “Unmixing of overlaps” for details) with sequences corresponding to tryptic peptides from bovine serum albumin (BSA) were used as analytes. In each measurement two out of three peptides were mixed in different ratios to get overlapping peptide signals, also with different charge states. Two different concentrations (500 fmol/μ l and 1000 fmol/μ l) were injected into the mass spectrometer via a Cole-Parmer syringe pump.

Validation strategy

Validation of pattern picking is notoriously difficult, because a gold standard which is satisfactory from both statistical and biological points of view is missing. In this context, a major problem one has to account for is that spectra frequently contain patterns whose shape is not distinguishable from those of peptides, but which are in fact various artifacts resulting e.g. from impurities during sample preparation and measurement. These artifacts do not constitute biologically relevant information and are, in this sense, ‘false positives’. An important instance are signals derived from the matrix (or from matrix-clusters) frequently observed in MALDI MS. The pattern of these signals is similar to that of peptides; nevertheless, due to their molecular composition, which differs significantly from that of an average peptide, the exact masses can be used to exclude these signals from the data analysis. On the other hand, from a statistical perspective which judges a method according to how well it is able to detect specific patterns in a given dataset, a qualification as ‘true positive’ is justified. With the aim to unify these aspects, we have worked out a dual validation scheme. In order to reduce the number of artifacts, all automatically generated lists of candidates for peptide masses as well as the lists of a human expert (see below) are postprocessed by a peptide mass filter [36]: only peptides whose monoisotopic mass deviated less than 200 ppm from the closest peptide mass center^a are used for subsequent evaluation.

Comparison with manual annotation

The first part investigates how well a method is able to support a human expert who annotates the spectra manually. More specifically, the automatically generated lists are matched to the manual annotation such that an entry of the list (potential peptide mass) is declared ‘true positive’ whenever there is a corresponding mass in the manual annotation deviating by no more than Δ ppm. Otherwise, it is declared ‘false positive’. In order to adapt Δppm to the resolution of the different mass lines, we used the following strategy: assuming that most of the peptides will have a mass larger than 700 Da, we determined the spacing Δ_m/z between neighboring data points in m/zdirection for each mass spectrum in the lower mass range. If we further assume that a simple manual annotation by visual inspection can result in a mass deviation from the ‘correct’ mass position of at most Δ_m/z, we can derive the following tolerance values: Δ=100 ppm for ion trap recordings, Δ=50 ppm in the case of MALDI-TOF recordings^b and Δ=20ppm for Orbitrap data.

As the performance of our as well as those of all competing methods depends on a threshold-like parameter governing, crudely speaking, the trade-off between precision and recall, we explore the performance for a range of reasonable parameter values, instead of fixing an (arbitrary) value, which we believe to be little meaningful. The results are then visualized as ROC curve, in which each point in the (Recall, Precision)-plane corresponds to a specific choice of the parameter. Formally, we introduce binary variables {B_i(t)} for each mass i contained in the list of cardinality $\hat{L} (t)$ when setting the threshold equal to t, where B_i(t)equals 1 if the mass is matched and 0 otherwise, and denote by L the number of masses of the manual annotation. The true positive rate (recall, R), and the positive predictive value (precision, P) associated with threshold t are then defined by $R (t) = \frac{\sum_{i} B_{i} (t)}{L}$ , $P (t) = \frac{\sum_{i} B_{i} (t)}{\hat{L} (t)}$ . An ROC curve results from a sequence of pairs {R(t),P(t)}for varying t.

Database query

The second part evaluates the lists in terms of a query to the Mascot search engine [37], version 2.2.04. In particular, we account for a major problem of a manual annotation, namely that peptides yielding weak MS signals might easily be overlooked, but might be detected by methods designed to extract those weak signals. Since we are especially interested in demonstrating the method’s ability to separate overlapping patterns, we adapted the standard search parameters of Mascot’s peptide mass fingerprint routine to allow two missed cleavage sites and to incorporate the following (variable) post-translational modifications: ‘Oxidation (M)’, ‘Carbamidomethyl (C)’, ‘Amidated (Protein C-term)’, ‘Deamidated (NQ)’. In particular, the latter two modifications will frequently trigger MS signals interleaving with the pattern of their unmodified counterpart: in the case of a deamidation the modified ion shows a mass of approx. 0.98 Da more compared to the amidated peptide. The same mass tolerances as for the manual annotation are used. As for the comparison with the manual annotation, we evaluate several lists corresponding to different choices of the threshold. Instead of an ROC curve, which turned out to be visually unpleasant, we display the statistics (score, coverage and fraction of hits) of two lists per method, namely of those achieving the best score and the best coverage, respectively. The complete set of results as well as further details of our evaluation like the manual annotation are contained in Additional file 3.

Competing methods

We compare our method in its two variants depending on the choice of the fitting criterion (cf. Eq. (7)), labelled l₁(q=1) and l₂(q=2), respectively, with the following competing methods.

Lasso

The ‘lasso’ method in this paper serves as surrogate for NITPICK. Since the ‘lasso’ is embedded into our framework while implementing a methodology that closely resembles NITPICK, we use the ‘lasso’ for the sake of convenience, to avoid an involved parameter optimization for NITPICK. Our lasso implementation benefits from the improved merging procedure of Section “Postprocessing and thresholding”. To accomodate a heterogeneous noise level, NITPICK divides spectra into bins. This can be avoided by determining a minimizer $\hat{β} (λ; W)$ of the weighted non-negative lasso problem

\begin{align} min_{β \geq 0} ∥ y - Φ β ∥_{2}^{2} + λ 1^{⊤} W β, λ > 0, \end{align}

(12)

where W is a diagonal matrix with entries w_c,j=LNL₊ (m_c,j), j=1,…,p_c, c=1,…,C, whose purpose is to re-scale the amount of ℓ₁-regularization according to the local noise level. The columns of the template matrix Φ in (8)). The parameter λhere plays the role of the threshold t, cf. Section “Validation strategy”.

Pepex

As discussed in Section “Template fitting”, pepex performs centroiding and de-isotoping separately. De-isotoping is based on non-negative least squares. Since pepex is limited to detect patterns of charge state one, its performance is only assessed for MALDI-TOF spectra. Accordingly, when comparing the ouptput of pepex with the manual annotation, the few patterns of charge state two are excluded. The parameters nm, pft, mincd, maxcd and nsam were set to optimize performance with respect to manual annotation. The ROC curves are based on peaklists resulting from ten different choices of the signal-to-noise parameter snr.

Isotope wavelet

As opposed to our method, this approach is not able to handle overlaps. On the other hand, it typically shows strong performance in noisy and low intensity regions or on datasets with extremely low concentrations [39, 40]. While the isotope wavelet is not limited to charge one, it is run in charge one only mode for the MALDI-TOF spectra, to achieve more competitive performance. For the sake of fair of comparison, the result of the isotope wavelet on the MALDI-TOF spectra are evaluated in the same way as those of pepex.

Vendor

The parameter setting for the ABI MALDI-TOF/TOF MS software was as follows: Local Noise Width (m/z) 250, Min Peak Width at FWHM 2.9. The Cluster Area Optimization S/N threshold has been dynamically adapted to about three times the S/N threshold as suggested by the ABI documentation. Since the vendor software is limited to charge one, its outputs are evaluated in the same way as those of pepex. Given the disproportionally high effort needed to find an optimal parameter setting of the vendor software for ESI spectra, its performance is not assessed.

Results

Manual annotation vs. database query

When inspecting Figures 4 and 5 on the one hand and Table 1 on the other hand, one notices that results of the evaluation based on the manual annotation are not in full accordance with the results of the database query. The difference is most striking for the MALDI-TOF spectra at 500 fmol, where our methods (l₁ and l₂) yield a significant improvement, which does not become apparent from the database query. This is because only a fraction of the manual annotation is actually confirmed by the database query. The part which is not matched likely consists of artifacts due to contamination or chemical noise as well as of specific modifications not captured by the database query. In light of this, our dual validation scheme indeed makes sense.

Table 1 Mascot results

Full size table

Comparison

Figure 4 and Table 1 reveal an excellent performance of our methods (l₁ and l₂) throughout all MALDI-TOF spectra under consideration. For the myoglobin spectra high sequence coverages are attained that clearly stand above those of competing methods. For the spectra at 10 fmol, only the performance of lasso is competetive with that of our methods in terms of the Mascot score; all other competitors, including the vendor software which has been tailored to process these spectra, are significantly weaker. In particular, the strikingly high proportion of ‘hits’ (≥94%) indicates that even at moderate concentration levels, our methods still distinguish well between signal and noise. This observation is strongly supported by the ROC curves in Figure 4, where the precision drops comparatively slowly with increasing recall. In this regard, our methodology clearly contrasts with approaches like the isotope wavelet that aim at achieving high protein sequence coverage. The latter often requires the selection of extremely lowly abundant peptide signals hidden in noise at the expense of reduced specificity.

For MALDI-TOF spectra at high concentration levels, pepex achieves the best scores and is competitive with respect to sequence coverage. However, the performance of pepex degrades dramatically at lower concentration levels, as it is unambiguously shown by both parts of the evaluation. In particular, the database scores are the worst among all methods compared. This provides some support for our reasoning at the end of Section “Template fitting”.

For the ESI spectra, our methods in total fall a bit short of the lasso (particularly for the ion trap spectra), but perform convincingly as well, thereby demonstrating that they can deal well with multiple charge states. This is an important finding, since the presence of multiple charges makes the sparse recovery problem as formulated in model (1) much more challenging, because the number of parameters to be estimated as well as the correlations across templates are increased. In spite of these difficulties, Figure 5 and Table 1 suggest that the performance of our pure fitting approach (7) does not appear to be affected. Using a more difficult set of spectra, the capability to process ESI data with impressive success is additionally shown in the next section.

Additional remarks

In Figure 4, the area under the curve (AUC) of our methods attained for myoglobin is higher for lower concentration. At first glance, this may seem contradictory since an increase in concentration should lead to a simplified problem. However, a direct comparison of the AUCs is problematic, since the number of true positives (17 at 10 fmol, 106 at 500 fmol) is rather different. For instance, there are choices of the threshold that yield 18 true positives and not a single false positive for both of our methods at 500 fmol, yet the AUC is lower.
The fact that some of the ROCs start in the lower left corner results from outputs containing only false positives.

Unmixing of overlaps

Motivation

One of the main advantages of our method over more simplistic pattern picking methods is the ability to disentangle isotope patterns of overlapping peptide signals, whose presence may lead to a significantly more challening pattern picking problem as e.g. discussed in [41] in the slightly different context of intact protein mass spectra. Therefore, a potential application for our approach will be the analysis of a certain class of posttranslational modifications, the deamidation of amino acid residues containing a carboxamide side chain functionality. The deamidation of asparagine (Asn) or glutamine (Gln) residues, yielding aspartic acid (Asp) or glutamic acid (Glu) residues, respectively, is an important posttranslational modification, which can have immense effects on the structure of peptides [42] and is of great relevance in a number of pathophysiological events [43]. During the deamidation, the side chain carboxamide is hydrolysed, which is accompanied by a mass increase of 0.98 Da. Thus, in a spectrum of a mixture of the amidated and deamidated form, a direct overlap of both signals can be observed. It has to be noted that additionally to the amidated/deamidated forms, in case of Asn deamidation, a second product containing an iso-peptide bond is formed, too, which has the same molecular behaviour; these two forms can be identified solely by their differential MS/MS behavior.

Results

The peptides analyzed here in order to assess the performance of our approach were synthesized by means of Fmoc-solid phase peptide synthesis; sequences corresponding to tryptic peptides from bovine serum albumin (BSA) with the sequences listed in Table 2 were selected.

Table 2 Peptides mixed together

Full size table

In each measurement two out of the three listed peptides were mixed together in different ratios (Additional file 4). Given such a spectrum, we study the question whether our method returns the true underlying composition. We classify the output of our method as correct interpretation of the spectrum if the templates corresponding to the true underlying peptides achieve signal-to-noise-ratios of at least one and these ratios are the two largest among all templates used for fitting. This procedure corresponds to a selection-optimal choice of the threshold based on the knowledge of the true composition of the spectrum. This simplification may be justified in view of the extreme difficulty of the problem as illustrated in Figure 6, in particular in view of lowly resolved spectra with an average m/z-spacing of 0.06 Da. For the remaining parameters, we compare a grid search (performed separately for each spectrum) and the default parameter set (Section “3 and Figure 6 indicate that already the default parameter setting is able to solve successfully a wide range of problem instances. As one would expect, Table 3 and Figure 6 suggest that the higher the concentration and the more balanced the amplitudes of the overlapping peptides, the more likely it is that the overlap can be resolved. On the other hand, the higher the degree of overlap of the peptides, which depends on both their charges and the distance of their positions, the more difficult the problem is. This becomes obvious when considering the overlap of the two peptides located at 351.2 and 351.4 Da, respectively.

Table 3 Unmixing of overlaps

Full size table

Conclusion

We have proposed a template matching approach for feature extraction in proteomic mass spectra. The main methodological innovation is a framework for sparse recovery in which sparsity is not promoted explicitly by a regularization term, as it is usually done and was done in previous work. We fully exploit the strength of non-negativity constraints, which permits us to circumvent the delicate choice of a ‘proper’ amount of regularization, an ever-lasting problem in statistics, and to work with thresholding instead. The latter is not only computationally attractive, because one does not have to repeatedly solve the same optimization problem for different choices of the regularization parameter, but also increases user-friendliness, since the threshold is directly related to the signal-to-noise ratio, the quantity domain experts are interested in. The replacement of a regularization parameter by a threshold is a cornerstone in our conceptual design guided by the principle to relieve the user from laboursome fine tuning of parameters. We believe that a small set of well-interpretable parameters with suitable defaults additionally improves robustness and reproducibility of results. In this context, we would like to emphasize again that apart from the threshold, the user does not have to specify any parameters before running our software.

In a comprehensive experimental study involving instruments of varying resolution and spectra of varying concentration levels, where we comparatively assess the performance of our approach on the basis of an elaborate dual validation scheme, it is demonstrated that the performance for pattern picking is excellent for MALDI-TOF spectra and outstands due to its specificity in selecting signal and only little noise. A major strength of the method is its ability to unmix overlapping peptide signals as shown for a series of ESI spectra. In total, we demonstrate that our approach is broadly applicable to a variety of spectra. While our approach is guided by a concrete application in proteomics, the framework is general enough to be of much of use for related deconvolution problems emerging in other fields − only the templates have to be adjusted according to the specific application.

While in this paper, we have focused on single spectra, the approach can be extended to process whole LC-MS runs, as it has already been implemented in our R package IPPD. More precisely, the sweep line scheme of [44] is used to agglomerate the results from single spectra. To apply our methods on a routine basis, an improved implementation, notably parallelization, is required, since e.g. processing a single spectrum of the Maxquant datasets [2] takes 10s on average on a Unix system equipped with an Intel(R) Core(TM)2 Duo CPU T9400 (2.53GHz) and 4 GB main memory. There is much room for an improvement, since our implementation is based on interpreted R code.

Concerning future directions of research, a question we have not yet answered in a satisfactory way is the choice of the fitting criterion. While both criteria (least squares and least absolute deviation) employed in this paper perform well, their implicit assumption of additive noise might be questionable [45]. It is worth investigating whether a multiplicative noise model could even yield better results. Second, one might ask whether the performance could be further improved when it is used jointly with the isotope wavelet, which is affected by overlaps, but has the potential to achieve higher protein sequence coverage.

Endnotes

^aMonoisotopic peptide mass centers are modelled by: 1.000485·m_n + 0.029, where m_n denotes the nominal mass.^bFor the MALDI-TOF lysozyme datasets an extended search tolerance of 100ppm was applied due to experimental miscalibration of the MS.

Appendix

Fitting with non-negativity constraints

In the following, we provide the details concerning optimization problem (7). In view of the special structure of Φ, (7) is computationally tractable even if n and the number of templates are in the ten thousands. We exploit the sparsity of the problem arising from templates which are highly localized, i.e. the domain on which they are numerically different from zero covers only a small part of the whole m/z range of the spectrum. As a consequence both Φand the Gram matrix Φ^⊤Φ, which is crucial in the computation, can conveniently be handled by using software for sparse matrices. For R, such software is available in the Matrix package [46].

Non-negative least squares

Consider the quadratic program

\begin{align} min_{β} \frac{1}{2} ∥ y - Φ β ∥_{2}^{2} \end{align}

(13)

\begin{align} subject to β \geq 0 . \end{align}

(14)

In order to solve (12), we use the so-called log-barrier method which amounts to solving a sequence of an unconstrained nonlinear convex problems in which the constraints I(β_j≥0), j=1,…,p, are taken into account by incorporating log-barrier terms −log(β_j)/γ in the objective. As γ→∞, the log-barrier acts like a function which equals + ∞if β_j<0and zero otherwise. Beginning with a moderately sized starting value for γ, we solve the convex problem

min_{β} \frac{1}{2} ∥ y - Φ β ∥_{2}^{2} - \frac{1}{γ} \sum_{j = 1}^{p} log (β_{j})

(15)

using Newton’s method. The gradient and Hessian with respect to β, respectively, are given by

\begin{align} \nabla_{β} & = - Φ^{⊤} (y - Φ β) - \frac{1}{γ} {[1 / β_{1} \dots 1 / β_{p}]}^{⊤} . \\ \nabla_{β}^{2} & = Φ^{⊤} Φ + \frac{1}{γ} diag (1 / β_{1}^{2}, \dots, 1 / β_{p}^{2}) . \end{align}

(16)

The Newton descent direction d_β is obtained from the linear system

\nabla_{β}^{2} d_{β} = - \nabla_{β} .

(17)

Solution of linear systems of this structure constitutes the main computational effort to be made. Fast solutions are obtained by using CHOLMOD[47], which offers an efficient implementation for computing the Cholesky factorization of sparse symmetric, positive definite matrices. Since the diagonal of $\nabla_{β}^{2}$ changes from one Newton iteration to the next, one Cholesky factorization has to be performed per Newton step. Once we have solved (14) for a specific γ, we solve a new problem of the type (14) for γ·M, M>1. This is repeated until γexceeds a predefined maximum value. For a thorough account on the log-barrier method, we refer to [20].

Complexity analysis of non-negative least squares

We here provide the order of magnitude of floating points operations (flops) required per update (i.e. per Newton step) for the specific non-negative least squares problems considered for this paper. In our implementation, we exploit that the templates contained in the matrix Φare highly localized. As a result, after a suitable column permutation, the matrix Φ^⊤Φ is roughly a band matrix with bandwidth k no larger than only few hundreds. The dominant operation is solving the linear system $\nabla_{β}^{2} d_{β} = - \nabla_{β}$ with the help of the Cholesky factorization, which can be done in O(p k²)flops (e.g. [20], p.670). Our algorithm terminates after usually no more than one hundred Newton steps.

Non-negative least absolute deviation

Consider the optimization problem

\begin{align} min_{β} ∥ y - Φ β ∥_{1} \end{align}

(18)

\begin{align} subject to β \geq 0 . \end{align}

(19)

Problem (15) can be recast as the following linear program.

\begin{align} min_{r} r^{⊤} 1 \\ subject to \\ Φ β - y + r \geq 0, \\ y - Φ β + r \geq 0, \\ r \geq 0, \\ β \geq 0 . \end{align}

(20)

For its solution, we use the log-barrier method sketched in the previous paragraph. After incorporating log-barrier terms for all constraints, the objectives of the unconstrained convex problems are of the form

\begin{matrix} r^{⊤} 1 - \frac{1}{γ} (\sum_{i = 1}^{n} (log (ξ_{i}^{+}) + log (ξ_{i}^{-}) + log (r_{i})) + \sum_{j = 1}^{p} log (β_{j})), \end{matrix}

(21)

where we have used the notational shortcuts

\begin{array}{l} ξ_{i}^{+} = {(Φ β)}_{i} - y_{i} + r_{i}, \\ ξ_{i}^{-} = y_{i} - {(Φ β)}_{i} + r_{i}, i = 1, \dots, n. \end{array}

(22)

The gradients w.r.t. rand β, respectively, are given by

\begin{array}{l} \nabla_{r} = 1 - \frac{1}{γ} {[\frac{1}{(ξ_{1}^{+} + ξ_{1}^{-} + r_{1})} \dots \frac{1}{(ξ_{n}^{+} + ξ_{n}^{-} + r_{n})}]}^{⊤}, \\ \nabla_{β} = - \frac{1}{γ} (Φ^{⊤} ({[Ξ^{+}]}^{- 1} - {[Ξ^{-}]}^{- 1}) 1 + {[1 / β_{1} \dots 1 / β_{p}]}^{⊤}), \\ Ξ^{\pm} = diag (ξ_{1}^{\pm}, \dots, ξ_{n}^{\pm}) . \end{array}

(23)

Introducing R=diag(r₁,…,r_n) and B=diag(β₁,…,β_p), the Hessian is given by the block matrix

[\begin{array}{l} \nabla_{r}^{2} & \nabla_{r β} \\ \nabla_{r β}^{⊤} & \nabla_{β}^{2} \end{array}] = [\begin{array}{l} \frac{1}{γ} ({[Ξ^{+}]}^{- 2} + {[Ξ^{-}]}^{- 2} + R^{- 2}) & \frac{1}{γ} ({[Ξ^{+}]}^{- 2} Φ - {[Ξ^{-}]}^{- 2} Φ) \\ \frac{1}{γ} (Φ^{⊤} {[Ξ^{+}]}^{- 2} - Φ^{⊤} {[Ξ^{-}]}^{- 2}) & \frac{1}{γ} (Φ^{⊤} ({[Ξ^{+}]}^{- 2} + {[Ξ^{-}]}^{- 2}) Φ) + B^{- 2}) \end{array}] .

(24)

The linear system for the Newton descent directions reads

[\begin{matrix} \nabla_{r}^{2} & \nabla_{r β} \\ \nabla_{r β}^{⊤} & \nabla_{β}^{2} \end{matrix}] [\begin{matrix} d_{r} \\ d_{β} \end{matrix}] = - [\begin{array}{l} \nabla_{r} \\ \nabla_{β} \end{array}] .

(25)

Note that $\nabla_{r}^{2}$ is diagonal, so it is a cheap operation to resolve for d_ronce d_β is known:

d_{r} = - {(\nabla_{r}^{2})}^{- 1} (\nabla_{r β} d_{β} + \nabla_{r}) .

(26)

Plugging this into the second block of the linear system, one obtains

- \nabla_{r β}^{⊤} {(\nabla_{r}^{2})}^{- 1} (\nabla_{r β} d_{β} + \nabla_{r}) + \nabla_{β}^{2} d_{β} = - \nabla_{β}

(27)

which is equivalent to

(\nabla_{β}^{2} - \nabla_{r β}^{⊤} {(\nabla_{r}^{2})}^{- 1} \nabla_{r β}) d_{β} = - \nabla_{β} + \nabla_{r β}^{⊤} {(\nabla_{r}^{2})}^{- 1} \nabla_{r} .

(28)

In order to solve the linear system, we proceed as for non-negative least squares. The computational cost of this operation is roughly the same, since the sparse structure of Φ^⊤Φcan still be exploited. For non-negative least squares, re-computation of the Hessian $\nabla_{β}^{2}$ only involves a diagonal update, an operation of negligible computational cost. However, for non-negative least absolute deviation, computation $\nabla_{β}^{2}$ involves the matrix multiplication (Φ^⊤([Ξ⁺ ]⁻² + [Ξ⁻]⁻²)Φ), i.e. essentially a repeated computation of a scaled Gram matrix. In spite of the special structure of Φ^⊤Φ, the computational cost is of the same order as the solution of the linear system even when using a self-written routine for matrix multiplication tailored to the specific structure.

References

Mo F, Mo Q, Chen Y, Goodlett DR, Hood L, Omenn GS, Li S, Lin B: WaveletQuant, an improved quantification software based on wavelet signal threshold de-noising for labeled quantitative proteomic analysis. BMC Bioinformatics 2010, 11: 219. 10.1186/1471-2105-11-219
Article PubMed Central PubMed Google Scholar
Cox J, Mann M: MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat Biotechnol 2008, 26: 1367–1372. 10.1038/nbt.1511
Article CAS PubMed Google Scholar
Renard B, Kirchner M, Steen H, Steen J, Hamprecht F: NITPICK: peak identification for mass spectrometry data. BMC Bioinformatics 2008, 9: 355. 10.1186/1471-2105-9-355
Article PubMed Central PubMed Google Scholar
Hoopmann MR, Finney GL, MacCoss MJ: High-speed data reduction, feature detection, and MS/MS spectrum quality assessment of shotgun proteomics data sets using high-resolution mass spectrometry. Anal Chem 2007, 79: 5620–5632. 10.1021/ac0700833
Article PubMed Central CAS PubMed Google Scholar
Gambin A, Dutkowski J, Karczmarski J, Kluge B, Kowalczyk K, Ostrowski J, Poznanski J, Tiuryn J, Bakun M, Dadlez M: Automated reduction and interpretation of multidimensional mass spectra for analysis of complex peptide mixtures. Int J Mass Spectrom 2007, 260: 20–30. 10.1016/j.ijms.2006.06.011
Article CAS Google Scholar
Mantini D, Petrucci F, Pieragostino D, Del Boccio P, Di Nicola M, Di Ilio C, Federici G, Sacchetta P, Comani S, Urbani A: LIMPIC: a computational method for the separation of protein MALDI-TOF-MS signals from noise. BMC Bioinformatics 2007, 8: 101. 10.1186/1471-2105-8-101
Article PubMed Central PubMed Google Scholar
Noy K, Fasulo D: Improved model-based, platform-independent feature extraction for mass spectrometry. Bioinformatics 2007, 23: 2528–2535. 10.1093/bioinformatics/btm385
Article CAS PubMed Google Scholar
Kaur P, O’Connor PB: Algorithms for automatic interpretation of high resolution mass spectra. J Am Soc Mass Spectrom 2006, 17: 459–468. 10.1016/j.jasms.2005.11.024
Article CAS PubMed Google Scholar
Tibshirani R: Regression shrinkage and variable selection via the lasso. J R Stat Soc Ser B 1996, 58: 671–686.
Google Scholar
Du P, Angeletti R: Automatic deconvolution of isotope resolved mass spectra using variable Selection and quantized peptide mass distribution. Anal Chem 2006, 78: 3385–3392. 10.1021/ac052212q
Article CAS PubMed Google Scholar
Tibshirani R: Regression shrinkage and selection via the lasso: a retrospective (with discussion). J R Stat Soc Ser B 2011, 73: 273–282. 10.1111/j.1467-9868.2011.00771.x
Article Google Scholar
Slawski M, Hein M: Sparse recovery by thresholded non-negative least squares. In Advances in Neural Information Processing Systems 24. MIT press, Cambridge, Massachusetts; 2011:1926–1934.
Google Scholar
Lange E, Gropl C, Reinert K, Kohlbacher O, Hildebrandt A: High-accuracy peak picking of proteomics data using wavelet techniques. Pac Symp Biocomput 2006, 11: 243–254.
Google Scholar
Schulz-Trieglaff O, Hussong R, Gröpl C, Hildebrandt A, Reinert K: A fast and accurate algorithm for the quantification of peptides from mass spectrometry data. In Proceedings of the Eleventh Annual International Conference on Research in Computational Molecular Biology (RECOMB 2007), Volume 11,. Springer, Berlin; 2007:437–487.
Google Scholar
Zubarev R: Accurate monoisotopic mass measurements of peptides: possibilities and limitations of high resolution time-of-flight particle desorption mass spectrometry. Rapid Commun Mass Spectrom 1996, 10(11):1386–1392. 10.1002/(SICI)1097-0231(199608)10:11<1386::AID-RCM652>3.0.CO;2-T
Article CAS Google Scholar
Senko M, Beu S, McLafferty F: Determination of monoisotopic masses and ion populations for large biomolecules from resolved isotopic distributions. J Am Soc Mass Spectrom 1995, 6: 229–233. 10.1016/1044-0305(95)00017-8
Article CAS PubMed Google Scholar
Horn DM, Zubarev RA, McLafferty FW: Automated reduction and interpretation of high resolution electrospray mass spectra of large molecules. J Am Soc Mass Spectrom 2000, 11: 320–332. 10.1016/S1044-0305(99)00157-9
Article CAS PubMed Google Scholar
Lou X, Renard B, Kirchner M, Koethe U, Graf C, Lee C, Steen J, Steen H, Mayer M, Hamprecht F: Deuteration distribution estimation with improved sequence coverage for HDX/MS experiments. Bioinformatics 2010, 26: 1535–1541. 10.1093/bioinformatics/btq165
Article CAS PubMed Google Scholar
Suits F, Hoekman B, Rosenling T, Bischoff R, Horvatovich P: Threshold-avoiding proteomics pipeline. Anal Chem 2011, 83: 7786–7794. 10.1021/ac201332j
Article CAS PubMed Google Scholar
Boyd S, Vandenberghe L: Convex Optimization. Cambridge University Press, New York; 2004.
Book Google Scholar
Samuelsson J, Dalevi D, Levander F, Rognvaldsson T: Modular, scriptable and automated analysis tools for high-throughput peptide mass fingerprinting. Bioinformatics 2004, 20: 3628–3635. 10.1093/bioinformatics/bth460
Article CAS PubMed Google Scholar
Efron B, Hastie T, Johnstone I, Tibshirani R: Least Angle Regression (with discussion). Ann Stat 2004, 32: 407–499. 10.1214/009053604000000067
Article Google Scholar
van de Geer S, Bühlmann P: On the conditions used to prove oracle results for the Lasso. Electron J Stat 2009, 3: 1360–1392. 10.1214/09-EJS506
Article Google Scholar
Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning, 2nd Edition. Springer, New York; 2008.
Google Scholar
Bruckstein A, Elad M, Zibulevsky M: On the uniqueness of nonnegative sparse solutions to underdetermined systems of equations. IEEE Trans Inf Theory 2008, 54: 4813–4820.
Article Google Scholar
Wang M, Tang A: Conditions for a unique non-negative solution to an underdetermined system. In Proceedings of Allerton Conference on Communication, Control, and Computing, Volume 49,. IEEE Press, Piscataway, New Jersey; 2009:301–307.
Google Scholar
Donoho D, Tanner J: Counting the faces of randomly-projected hypercubes and orthants, with applications. Discrete Comput Geometry 2010, 43: 522–541. 10.1007/s00454-009-9221-z
Article Google Scholar
Meinshausen N: Sign-constrained least squares estimation for high-dimensional regression. Tech. rep.. Department of Statistics, Oxford University; 2012.
Google Scholar
Meinshausen N, Yu B: Lasso-type recovery of sparse representations for high-dimensional data. Ann Stat 2009, 37: 246–270. 10.1214/07-AOS582
Article Google Scholar
Zhou S: Thresholding Procedures for high dimensional variable selection and statistical estimation. In Advances in Neural Information Processing Systems 22. MIT press, Cambridge, Massachusetts; 2009:2304–2312.
Google Scholar
Zhang T: Some sharp performance bounds for least squares regression with L1 regularization. Ann Stat 2009, 37: 2109–2144. 10.1214/08-AOS659
Article Google Scholar
Donoho D, Johnstone I: Ideal spatial adaption by Wavelet shrinkage. Biometrika 1994, 81: 425–455. 10.1093/biomet/81.3.425
Article Google Scholar
Fan J, Li R: Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Association 2001, 97: 210–221.
Google Scholar
Wasserman L, Roeder K: High-dimensional variable selection. Ann Stat 2009, 37: 2178–2201. 10.1214/08-AOS646
Article PubMed Central PubMed Google Scholar
Fan J, Guo S, Hao N: Variance estimation using refitted cross-validation in ultrahigh dimensional regression. J R Stat Soc Ser B 2012, 74: 37–65. 10.1111/j.1467-9868.2011.01005.x
Article Google Scholar
Wolski WE, Farrow M, Emde AK, Lehrach H, Lalowski M, Reinert K: Analytical model of peptide mass cluster centres with applications. Proteome Sci 2006, 4: 18. 10.1186/1477-5956-4-18
Article PubMed Central PubMed Google Scholar
Perkins DN, Pappin DJ, Creasy DM, Cottrell JS: Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20: 3551–3567. 10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2
Article CAS PubMed Google Scholar
Friedman J, Hastie T, Tibshirani R: Regularized paths for generalized linear models via coordinate descent. J Stat Software 2010, 33: 1–22.
Article Google Scholar
Hussong R, Tholey A, Hildebrandt A: Efficient Analysis of Mass Spectrometry Data Using the Isotope Wavelet. In COMPLIFE 2007: The Third International Symposium on Computational Life Science, Volume 940(1),. Edited by: Siebes APJM, Berthold MR, Glen RC, Feelders AJ. AIP, Melville; 2007:139–149.
Google Scholar
Hussong R, Gregorius B, Tholey A, Hildebrandt A: Highly accelerated feature detection in proteomics data sets using modern graphics processing units. Bioinformatics 2009, 25: 1937–1943. 10.1093/bioinformatics/btp294
Article CAS PubMed Google Scholar
Liu X, Inbar Y, Dorrestein P, Wyne C, Edwards N, Souda P, Whitelegge J, Bafna V, Pevzner P: Decovolution and database search of complex tandem mass spectra of intact proteins. Mol Cell Proteomics 2010, 9: 2772–2782. 10.1074/mcp.M110.002766
Article PubMed Central CAS PubMed Google Scholar
Tholey A, Pipkorn R, Bossemeyer D, Kinzel V, Reed J: Influence of myristoylation, phosphorylation, and deamidation on the structural behavior of the N-Terminus of the Catalytic subunit of CAMP-Dependent protein kinase. Biochemistry 2001, 40: 225–231. 10.1021/bi0021277
Article CAS PubMed Google Scholar
Reissner K, Aswad D: Deamidation and isoaspartate formation in proteins: unwanted alterations or surreptitious signals? Cell Mol Life Sci 2003, 60: 1281–1295. 10.1007/s00018-003-2287-5
Article CAS PubMed Google Scholar
Schulz-Trieglaff O, Hussong R, Gröpl C, Leinenbach A, Hildebrandt A, Huber C, Reinert K: Computational quantification of peptides from LC-MS data. J Comput Biol 2008, 15: 685–704. 10.1089/cmb.2007.0117
Article CAS PubMed Google Scholar
Du P, Stolovitzky G, Horvatovich P, Bischoff R, Lim J, Suits F: A noise model for mass spectrometry based proteomics. Bioinformatics 2008, 24: 1070–1077. 10.1093/bioinformatics/btn078
Article CAS PubMed Google Scholar
Bates D, Maechler M: Matrix: Sparse and Dense Matrix Classes and Methods. 2009. [R package version 0.999375–21] [R package version 0.999375-21]
Google Scholar
Davis T: CHOLMOD: sparse supernodal Cholesky factorization and update/downdate. 2005.
Google Scholar

Download references

Acknowledgements

The authors would like to thank Markus Martin for setting up the Bruker Daltonics HCT Ultra Ion Trap MS and Bart van den Berg for measuring the LC-MS datasets used in the vignette of the R package IPPD. We thank Fredrik Levander and Thorsteinn Rognvaldsson for providing us the pepex implementation. We thank the reviewers and an associate editor whose comments and suggestions led to a substantial improvement over previous drafts.

Funding

Clusters of Excellence ‘Multimodal Computing and Interaction’ (to M.S., R.H. and B.G.), ‘Inflammation@Interfaces’ (to A.T. and T.J.) within the Excellence Initiative of the German Federal Government; DFG (grants BIZ4:1-4 to R.H. and A.H.).

Author information

Authors and Affiliations

Department of Computer Science, Saarland University, Saarbrücken, Germany
Martin Slawski & Matthias Hein
Center for Bioinformatics, Saarland University, Saarbrücken, Germany
Rene Hussong, Barbara Gregorius & Andreas Hildebrandt
Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Esch-sur-AlzetteLuxembourg, Germany
Rene Hussong
Division for Systematic Proteome Research, Institute for Experimental Medicine, Kiel, Germany
Andreas Tholey, Thomas Jakoby & Barbara Gregorius
Institut für Informatik, Johannes-Gutenberg -Universität, Mainz, Germany
Andreas Hildebrandt

Authors

Martin Slawski
View author publications
You can also search for this author in PubMed Google Scholar
Rene Hussong
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Tholey
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Jakoby
View author publications
You can also search for this author in PubMed Google Scholar
Barbara Gregorius
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Hildebrandt
View author publications
You can also search for this author in PubMed Google Scholar
Matthias Hein
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Martin Slawski.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

MS and MH devised the methodology as presented in Section “Methods”. MS implemented the Bioconductor package, with contributions by RH and MH. The comparative data analysis was performed by RH, MS, MH and AH; RH and AH performed the MASCOT queries. AT developed the experimental design and provided an interpretation of the MS data. TJ and BG conducted the MS experiments and produced the results of the vendor software. All authors read and approved the final manuscript.

Electronic supplementary material

Additional file 1: MALDI-TOF spectra. (ZIP 3 MB)

Additional file 2: ESI spectra. (ZIP 444 KB)

Additional file 3: Evaluation and results. (ZIP 4 MB)

Additional file 4: Overlapping peptide signals. (ZIP 156 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Slawski, M., Hussong, R., Tholey, A. et al. Isotope pattern deconvolution for peptide mass spectrometry by non-negative least squares/least absolute deviation template matching. BMC Bioinformatics 13, 291 (2012). https://doi.org/10.1186/1471-2105-13-291

Download citation

Received: 06 March 2012
Accepted: 27 October 2012
Published: 08 November 2012
DOI: https://doi.org/10.1186/1471-2105-13-291

Isotope pattern deconvolution for peptide mass spectrometry by non-negative least squares/least absolute deviation template matching

Abstract

Background

Results

Conclusions

Background

Methods

Template model

Parameter estimation

Template fitting

Comparison with pepex

Postprocessing and thresholding

Finding a set of default parameters

Sparse recovery with non-negativity constraints: non-negative least squares + thresholding vs. the non-negative lasso

Linear models and usual paradigms in statistics

The power of non-negativity constraints

Shortcomings of ℓ1-regularization in theory

Shortcomings of ℓ1-regularization in practice

Results and discussion

Datasets

Validation strategy

Comparison with manual annotation

Database query

Competing methods

Lasso

Pepex

Isotope wavelet

Vendor

Results

Manual annotation vs. database query

Comparison

Additional remarks

Unmixing of overlaps

Motivation

Results

Conclusion

Endnotes

Appendix

Fitting with non-negativity constraints

Non-negative least squares

Complexity analysis of non-negative least squares

Non-negative least absolute deviation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Electronic supplementary material

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us

Shortcomings of ℓ₁-regularization in theory

Shortcomings of ℓ₁-regularization in practice