Optimal precursor ion selection for LC-MALDI MS/MS
© Zerck et al.; licensee BioMed Central Ltd. 2013
Received: 20 July 2012
Accepted: 23 January 2013
Published: 18 February 2013
Skip to main content
© Zerck et al.; licensee BioMed Central Ltd. 2013
Received: 20 July 2012
Accepted: 23 January 2013
Published: 18 February 2013
Liquid chromatography mass spectrometry (LC-MS) maps in shotgun proteomics are often too complex to select every detected peptide signal for fragmentation by tandem mass spectrometry (MS/MS). Standard methods for precursor ion selection, commonly based on data dependent acquisition, select highly abundant peptide signals in each spectrum. However, these approaches produce redundant information and are biased towards high-abundance proteins.
We present two algorithms for inclusion list creation that formulate precursor ion selection as an optimization problem. Given an LC-MS map, the first approach maximizes the number of selected precursors given constraints such as a limited number of acquisitions per RT fraction. Second, we introduce a protein sequence-based inclusion list that can be used to monitor proteins of interest. Given only the protein sequences, we create an inclusion list that optimally covers the whole protein set. Additionally, we propose an iterative precursor ion selection that aims at reducing the redundancy obtained with data dependent LC-MS/MS. We overcome the risk of erroneous assignments by including methods for retention time and proteotypicity predictions. We show that our method identifies a set of proteins requiring fewer precursors than standard approaches. Thus, it is well suited for precursor ion selection in experiments with limited sample amount or analysis time.
We present three approaches to precursor ion selection with LC-MALDI MS/MS. Using a well-defined protein standard and a complex human cell lysate, we demonstrate that our methods outperform standard approaches. Our algorithms are implemented as part of OpenMS and are available under http://www.openms.de.
LC-MS/MS-based proteomics is a key technique for protein quantitation and identification. A typical workflow starts with the proteolytic digestion of protein samples, using usually trypsin. The resulting peptide mixture is inserted into a liquid chromatography (LC) column in which the peptides are eluted at different time points, called retention time (RT), according to their physicochemical properties (e.g. hydrophobicity and polarity). LC system and mass spectrometer are connected, either directly with Electrospray-MS (ESI-MS) or indirectly via fractionation onto a target plate as used in MALDI-MS. The resulting peptide signals in the LC-MS map are referred to as features while the selection of features for fragmentation with MS/MS is called precursor ion selection. Peptide identifications are assigned to MS/MS spectra using database search tools, such as Mascot  or X!Tandem , or by de novo sequencing [3, 4]. The peptide sequences are then used to reconstruct the proteins that were present in the sample.
A problem for protein identification with tandem mass spectrometry is the limited number of possible MS/MS acquisitions. Even in simple protein digests there are more detected peptide signals than possible selections for MS/MS . The number of possible fragmentations is either limited by the elution time of the peptide (ESI) or by the amount of sample available for each fraction (MALDI). A standard method for precursor ion selection with ESI-MS/MS is data dependent acquisition (DDA) which selects the x most intense signals in each MS spectrum for fragmentation, with x depending on the instrument type. However, as biological samples have a high dynamic range of protein abundance, the number of peptide identifications is biased towards high-abundance proteins, although low-abundance proteins are mostly of higher interest.
In order to circumvent redundancy, DDA can be combined with a dynamic exclusion list (DEX) that prevents fragmenting a signal at the same m / z-value within a specified RT range. Exclusion lists are often used for replicate analyses [6–11]: after each LC-MS/MS run the exclusion list is updated and contains the fragmented or identified signals of previous runs. In comparison to simple repetitions, Chen et al.  showed that the number of unique peptide identifications can be significantly increased. Bendall et al.  reached a higher number of proteins identifications.
A complementary strategy to exclusion lists is directed MS/MS. Instead of excluding potentially uninteresting signals, the selection focusses particularily on signals of interest. These signals are part of an inclusion list that contains the m / z-values and usually an RT window for each peptide. This procedure is a typical approach for LC-MALDI MS/MS where MS and MS/MS are decoupled. Thus, MS acquisition can be used to create a map of all detected signals which guides the precursor ion selection. Moreover, inclusion lists have also been used in combination with ESI-MS/MS: a consensus map of detectable LC-MS features created from previous runs was used to create the inclusion list [12–15]. These studies showed that compared with DDA directed MS/MS might identify a higher number of peptides [14, 15]. This effect is more pronounced for low intensity peptides .
In the last years, in several studies iterative approaches for precursor ion selection were applied. For instance, Scherl et al.  added theoretical m / z-values of tryptic peptides of already identified proteins to an exclusion list. In a previous study, we showed the effect of combining both the directed analysis of interesting signals and the exclusion of uninteresting signals through a heuristic . In our study, a prioritized list of all possible precursors was reranked during ongoing MS/MS acquisition based on the identifications yielded so far. Precursors having an m / z- value matching tryptic peptides of already identified proteins received a lower priority, whereas precursors matching tryptic peptides of uncertain protein candidates were assigned higher priorities. We demonstrated that this strategy can identify the same number of proteins as standard methods using fewer precursors. In our study, theoretical peptides were matched onto observed features using only the m/z-value. Thus, our method showed a clear dependence on mass accuracy and sample complexity.
Liu et al.  developed an iterative MS/MS acquisition (IMMA) approach that used different filtering techniques to exclude uninteresting signals. Proteotypic peptides of already identified proteins are excluded as well as signals with a mass defect untypical for peptides. This way, a larger number of proteins could be identified than with DDA.
In this manuscript, we introduce a deterministic framework that formulates the precursor ion selection problem as Integer Linear Program (ILP). We show that it can be easily adapted to variations of the original problem. We present three different scenarios and their corresponding optimization problems. We address the problem of erroneous peptide-precursor assignments by including predictions of RTs and proteotypic peptides into the matching. Furthermore, we employ a probabilistic scoring to infer proteins from peptide identifications. Our methods are implemented as part of the open-source library OpenMS  and will be available as a TOPP tool  as part of the next release of OpenMS.
Several precursor ion selection strategies are conceivable depending on the aim of a study and the available prior knowledge about the sample. Here, we focus on three settings: the first two use static inclusion lists created once before the MS/MS acquisition starts. The third changes the selection based on previous identifications. The two static inclusion list approaches differ in the information used during the selection process. In the following, when talking about peptides we refer to protein subsequences as opposed to precursors which denote MS/MS measurements.
Feature-based inclusion list: Given an LC-MS feature map, we want to maximize the number of scheduled precursors given some constraints on the number of simultaneous acquisitions per RT fraction. This is a common scenario with LC-MALDI due to its decoupled nature of LC and MS.
Protein-based inclusion list: Given a list of protein sequences but no prior LC-MS run, we want to find an optimal set of precursors that represents the proteins of interest best. As proteins are not identified directly we need to find a peptide set that optimally covers our specific proteins. For this peptide set, we predict the LC-MS features (i.e. retention time and m / z acquisition window) and add them to the inclusion list.
Iterative precursor ion selection: Given an LC-MALDI-MS feature map, we want to optimally exploit the set of possible precursors. Optimality in this case means that we want to identify the proteins in a sample using a minimal set of precursors, so that the remaining precursors can be used to discover other proteins. The precursor ion selection shall be adjusted during the ongoing MS/MS acquisition based on previous peptide and protein identifications. This way, we combine the discovery nature of DDA with directed MS/MS. As with both inclusion list formulations, the number of MS/MS acquisitions is limited by the number of precursors per RT fraction.
These settings can be formulated as optimization problems, which can be formalized as Integer Linear Programs (ILP). Solving the ILPs yields a list of precursor ions, the actual inclusion list. In the following sections we will introduce and explain the formulations.
For each feature j, we introduce a set of binary variables x j,s , which are set to 1 if we choose feature j in scan s as a precursor and 0 otherwise. Since we want to choose the best possible fraction for each precursor, we do not simply maximize the number of scheduled precursors but use the feature intensities as weights, because high intensity features are more likely to produce good and interpretable MS/MS spectra. The intensities are normalized by the maximal signal intensity the respective feature has in any spectrum. This results in weights between 0 and 1 and prevents a bias towards selecting only high intensity features.
Variables and constants used in LP formulations
Indicator variable, 1 if feature j is selected in spectrum s,
Indicator variable, 1 if feature j is part of the solution,
Normalized signal intensity of feature j in spectrum s
Maximal number of MS/MS precursors in spectrum s
Detectability of protein i
-log(1-D i ), higher values reflect a better protein detectability
Detectability of peptide k
Indicator variable, 1 if peptide k is part of protein i,
RT window size
maximal number of elements in inclusion list
Probability that peptide k was identified correctly
Probability that protein i was identified correctly
k 1,k 2,k 3
Indicator variable, 1 if the protein probability of protein i is at least c, 0 otherwise
Minimal protein probability to declare a protein identified
z i = 1 if P i ≥ c, otherwise z i ∈ [0, 1)
Set of features having an m/z within a specified ppm range around the theoretical m/z of peptide k
Matching probability of feature j with peptide k
Number of already fragmented precursors
Number of selected precursors in each iteration
Here, x j,s is an indicator variable, it is 1 if feature j is selected in spectrum s and 0 otherwise. i n t j,s is the normalized intensity of feature j in spectrum s, and cap s is the “capacity” of spectrum s, i.e., the maximal possible number of precursors for that spectrum. The problem of finding an optimal inclusion list is an instance of a well-known combinatorial problem, the Knapsack problem. We will show that solving our ILP yields a global optimal inclusion list.
In our implementation, we solve the ILP formulation using the GNU Linear Programming Kit (GLPK, http://www.gnu.org/software/glpk/). The solution provides values for all x j,s and all features j where x j,s = 1 are part of the final inclusion list. Due to Constraint (3), x j,s can only be 1 for at most one s for each precursor j. Thus, each precursor is scheduled in a specified fraction.
This problem can be formulated as an optimization problem as well. Again, we have the spectrum capacity and the number of times a feature can be selected as constraints. Additionally, we want to achieve a certain likelihood for each protein to be identified with the selected precursors. In the following, we will refer to this as the protein detectability, in analogy to peptide detectability which is the likelihood to detect and identify a peptide in a given experimental setup. We develop a formula to compute the protein detectability via the protein probability calculation in the next section. This finally leads to the formulation of the protein sequence-based LP.
which is invalid for P j = 1, so in this case we enter a pseudo count instead.
In shotgun LC-MS/MS experiments typically not all tryptic peptides of a protein are observed. Instead, a characteristic set of peptides exists which can be often identified for a specific protein, these peptides are called proteotypic peptides . Closely related to proteotypicity the detectability of a peptide is the likelihood that the molecular ions of the peptides are detected, fragmented by MS/MS and identified through a database search. There exist several approaches to predict peptide detectabilities [26–29]. We use a machine-learning approach from Schulz-Trieglaff et al.  for the prediction of the detectability of a peptide and denote it with d p . As d p is a likelihood, it ranges between 0 and 1.
Similar to detectability prediction it is also possible to predict the retention time of a peptide. Again, we use a machine-learning approach to predict the RTs of peptides . In our approach, we then assume an RT window around the predicted RT for the precursor ion selection.
Both methods for detectability and RT prediction use support vector regression (SVR) with a kernel function that works solely on the peptide sequence. For RT prediction, a training set of peptide identifications with accurate retention times is required. Model training for detectability prediction requires a positive set of observed peptides and a set of undetectable peptides.
In the following, we use the previously developed protein detectabilities for the creation of inclusion lists to maximize the sum of the protein detectabilities.
t p denotes the predicted RT for peptide p and ws is the RT window size.
Again, solving the LP formulation using a solver like GLPK yields a set of variables x j,s = 1 that build the inclusion list. In this case, we provide RT windows for each precursor in the inclusion list. Thus, for each precursor j there can be multiple x j,s = 1.
The methods described in the previous sections are used for inclusion list creation prior to MS/MS acquisition. In the following, we develop an LP formulation for iterative precursor ion selection where the selection is adapted during ongoing MS/MS acquisition. In contrast to replicate analyses, where new LC-MS and LC-MS/MS measurements are performed in each replication step, in iterative MS/MS acquisition the same sample and the same LC-MS map is used for the whole analysis. This is especially suited for LC-MALDI MS/MS as there the sample is “frozen” on the target and data acquisition can be suspended. After the initial LC-MS step an LC-MS feature map is created for the sample which is used for precursor ion selection. During the iterative analysis, in each iteration a set of precursors is chosen whose MS/MS acquisition is triggered. Variables corresponding to the selected precursor set are fixed for subsequent iterations. As we describe methods for LC-MALDI, we can step forward and backward “in time” by selecting fractions corresponding to different, not necessarily consecutive RTs.
The goal of the iterative precursor ion selection is twofold. On the one hand, a maximal number of proteins shall be identified with a given statistical confidence. On the other hand, a maximal possible number of precursors shall be fragmented which is limited by the available sample. For both aims, LP formulations were presented in the last sections. For the iterative precursor ion selection we combined these LPs. After each iteration, a database search is performed for each MS/MS spectrum. Afterwards, the LP formulation is adapted based on the search results. In the following, we describe the iterative workflow in detail.
Considering proteins in the LP has two main advantages: first, we want to target peptides hitting protein candidates. These are proteins for which we received peptide identifications, but that did not reach a sufficient significance to declare a protein identified. That way, lower intensity features are included into the precursor set which are likely to yield the missing identifications. On the other hand, signals potentially derived from already identified proteins contribute less weight to the objective function as these do not provide additional information.
Every time we find a new protein hit, we consider all its tryptic peptides and determine their matching LC-MS features. Therefore, a feature set M p is defined for each peptide p, containing all features within a predefined m / z-range around the theoretical m / z of p. Then, peptide detectabilities and peptide RTs are used to compute matching probabilities. m / z-values are only used to create a set of matching features – those within a specified m/z-range of peptide p – for which probabilities are computed. m / z - values and mass accuracy are themselves not included in the actual matching probability.
The objective function consists of three parts: protein-based and feature-based inclusion as well as exclusion. The inclusion parts contribute a positive value weighted by factors k 1 and k 2 while the exclusion part decreases the value of the objective function for peptide signals potentially derived from already identified proteins. It is weighted by k 3. Typical values for k 1, k 2 and k 3 are 10, 1 and 10, respectively. Constraint (25) ensures that a given protein significance is reached. It considers both the peptide probabilities of already identified peptides and the “theoretical probabilities” received from the matching probabilities of tryptic peptides and observed LC-MS features. By dividing by the transformed threshold significance c and the limitation z i ≤1, additional peptide identifications of an already identified protein do not contribute to the objective function. Algorithm 1 shows the online precursor ion selection in pseudo code.
For evaluation of the described methods we used two samples of different complexity. On the one hand, a well-defined protein standard consisting of 5 pmol each of 48 human proteins (UPS1, Sigma Aldrich). Sample 2 is a tryptic digest of a cell lysate of HEK293 cells. It was prepared and provided by the group of Prof. H. Meyer (Medical Proteome Center, Ruhr University Bochum, Germany). The LC-MS/MS acquisition was done by Anja Resemann (Bruker Daltonics, Bremen, Germany). Both data sets were used in a previous study. See  for a detailed description of the sample preparation.
Peptide identification was done with X!Tandem  (release CYCLONE (2010.12.01)) using TOPP’s XTandemAdapter . The combined target decoy version of the Swiss-Prot protein sequence database in Release 2011_08 was searched with taxonomy limited to human. Other settings included: 25 ppm mass tolerance, 0.3 Da fragment tolerance, +1 as minimal and maximal charge, methionine and tryptophane oxidation as variable modification, 1 allowed missed cleavage and tryptic cleavage sites. Additionally, carbamidomethylation of cysteines was used as fixed modification for UPS1.
We calculated peptide posterior error probabilities (PEP) using the TOPPtool IDPosteriorErrorProbability and then converted the PEPs into peptide probabilities: p i = 1 - PET i .
For RT and detectability model training, the TOPPtools RTModel and PTModel were used. The training data sets for UPS1 consisted of three replicate LC-MS/MS runs, filtered for peptide IDs with a probability of at least 0.99 and that are part of one of the 48 constituent proteins. For the HEK293 data set, we used the same probability threshold and filtered additionally for proteins with at least 4 peptide IDs in order to keep the training sets at a reasonable size. Figure 4 shows the deviation of predicted and observed RT for HEK293.
The three different strategies for creating a feature-based inclusion list as presented in Figure 1 - a greedy approach (GA), data dependent acquisition (DDA), and the ILP formulation (ILP) - were evaluated with a varying maximal number of precursors per RT bin, ranging from 1 to 40. This led to inclusion lists of increasing size for each of the strategies. We used RT bins of 30 and 10 seconds for UPS1 and HEK293, respectively. Additionally, DDA in conjunction with dynamic exclusion (DEX) of each scheduled precursor for the next two fractions was included in the analysis.
The inclusion list creation with the protein sequence-based ILP was evaluated on the protein standard. We compared the precursors in the inclusion list with the observed features. Whenever a predicted precursor overlapped with a feature, the peptide annotation of the feature was assigned to the precursor. This way, we were able to evaluate how many peptide and protein identifications an inclusion list would deliver. This is a strong assumption as it implies that for a given feature the fragmentation works at all RT bins in a similar quality. However, as pointed out before, it is justified by the limited reproducibility of repetitive LC-MS/MS analyses.
We further analyzed the influence of the RT window size on the number of protein IDs. We checked sizes from 30 up to 990 seconds and using either an inclusion list with maximally 1000 entries or one of unlimited size. The results for both are similar (Figure 9). With a moderate window size of 150 seconds 34 proteins can already be identified.
We compared the performance of the iterative precursor ion selection with LPs (IPS_LP) with a heuristic approach presented in a previous study  (HIPS) and with a static precursor ion selection based on an inclusion list which is sorted by intensity (SPS).
In a previous study we observed a drawback of the heuristic IPS: it showed a clear performance loss when applied to a complex sample with limited mass accuracy . This was due to erroneous assignments of theoretical peptides to observed features. In IPS_LP the specificity of assignments is increased by including RT and detectability of a peptide.
We further examined the performance of the different methods with respect to limited number of precursors per fractions, as shown in Figure 11(b). When the maximal number of precursors in each fraction is less than ten IPS_LP is able to identify more proteins than the other methods. This implies that IPS_LP is particularily applicable in conditions with limited amounts of sample.
Although with LC-MALDI MS/MS it is possible to select precursors in an order independent of their RT, in practice the sample is analyzed following a specific fraction order. Thus, in the following, we adapt the LP formulation so that it proceeds through the precursor set in a sequential order according to the fraction number.
Capacities of fractions with a lower number than s ∗ are fixed at the number of selected precursors for the respective fraction to prevent going back in RT dimension. Capacities of fractions after s ∗ are set to 0. When all precursors in s ∗ were selected or its capacity has been reached, the next fraction is set as s ∗.
We presented methods for precursor ion selection with LC-MALDI MS/MS. We showed that inclusion list creation can be formalized as optimization problem and efficiently solved with Linear Programs. Our methods can be used to schedule an optimal set of precursors. We presented exemplarily two situations where the available information prior to MS/MS acquisition differs. When the protein sequences of interest are known our method for inclusion list creation using protein sequences is well suited, as it creates very efficient inclusion lists. Various adaptations to our methods are possible that can be easily integrated. For instance, the protein sequence-based inclusion list can be adapted to consider not all tryptic peptides of a protein but a specific predefined set of peptides that can be used for quantification of proteins in different cell states or in time series.
Finally, we presented a new method for iterative precursor ion selection that identifies proteins more efficiently than data dependent methods. This efficiency improvement is twofold: peptides from already identified proteins contribute less weight to the objective function and thus are less likely to be selected as precursors. This way the redundancy of information obtained with MS/MS can be reduced. On the other hand, IPS_LP requires considerably fewer MS/MS acquisitions to identify the same number of proteins as a static inclusion list. The remaining sample and analysis time can be used for identifying more proteins in a sample. Compared to our previously published heuristic method, IPS_LP does not suffer from massive false exclusion of signals in complex samples by incorporating machine learning methods for RT and proteotypicity predictions.
We thank Anja Resemann and Detlef Suckau from Bruker Daltonics (Bremen, Germany) for providing the HEK293 data set. We thank Samira Jaeger and Anja Kasseckert for proofreading the manuscript.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.