A coherent mathematical characterization of isotope trace extraction, isotopic envelope extraction, and LC-MS correspondence
© Smith et al.; licensee BioMed Central Ltd. 2015
Published: 23 April 2015
Liquid chromatography-mass spectrometry is a popular technique for high-throughput protein, lipid, and metabolite comparative analysis. Such statistical comparison of millions of data points requires the generation of an inter-run correspondence. Though many techniques for generating this correspondence exist, few if any, address certain well-known run-to-run LC-MS behaviors such as elution order swaps, unbounded retention time swaps, missing data, and significant differences in abundance. Moreover, not all extant correspondence methods leverage the rich discriminating information offered by isotope envelope extraction informed by isotope trace extraction. To date, no attempt has been made to create a formal generalization of extant algorithms for these problems.
By enumerating extant objective functions for these problems, we elucidate discrepancies between known LC-MS data behavior and extant approaches. We propose novel objective functions that more closely model known LC-MS behavior.
Through instantiating the proposed objective functions in the form of novel algorithms, practitioners can more accurately capture the known behavior of isotope traces, isotopic envelopes, and replicate LC-MS data, ultimately providing for improved quantitative accuracy.
Liquid chromatography-mass spectrometry (LC-MS) is a popular technique for elucidating the composition of liquid samples. Data processing considerations are essential to accurately determine the identity of molecules (analytes such as lipids or peptides) contained in the sample (a process called identification), as well as their quantity in sample (a process called quantification).
Information about sample quantity is captured directly in survey scans, or MS (aka MS1) data. Fragmentation spectra of one or more analytes constitute MS/MS (or MS2) data, and this information is typically used to corroborate or ascertain the identity of a molecule. Partitioning/clustering MS1 signal from complex samples and mapping the signal to other analyses (correspondence) is challenging. Some quantification strategies bypass these challenges by using information derived directly or indirectly from MS/MS data. These methods include spectral counting  and isobaric tags for relative and absolute quantification (iTRAQ) . Though these methods have been successful, the amount of quantifiable signal embedded in MS1 data is estimated to far exceed what is currently available by MS/MS ; however, most MS1 data remains unused by current software. Hence, improving methods for partitioning and mapping MS1 signal stands to significantly (˜10 fold) increase the sensitivity of a typical label-free or isotope-labeling MS-omics experiment, both for experiments currently being run and for past experiments where raw data is still available.
Mass spectrometry data, in its raw form, is not ideal for isotope trace extraction or subsequent processing. After internally accumulating signal over discrete time slices, the mass spectrometer outputs raw data condensed into the form of many narrow profiles wherever signal is present. Conversion to centroid mode integrates the abundance of each of these profiles into a single tuple called a centroid. This is considered a routine conversion for which ample software is readily available. We adopt the typical convention of using centroid data.
Despite the ubiquity of LC-MS experiments, to the best of our knowledge, no concise, complete description of the LC-MS isotope trace and isotopic envelope extraction problems exists. Here, we describe constructs for isotope traces and isotopic envelopes, as well as formally describe the relationship of centroids, isotope traces and isotopic envelopes. In this context, we review extant objective functions for isotope trace extraction, isotopic envelope extraction, and correspondence. Finally, we propose novel objective functions for each of these tasks that address shortcomings in current approaches.
Results and discussion
Isotope trace extraction
The most important data processing step in a typical quantitative LC-MS pipeline is isotope trace extraction . Clustering centroids into isotope traces is a non-trivial problem due to the many sources of noise affecting centroid mass and abundance. Sources of noise affecting centroids include chemistry effects due to chromatography, abundance inaccuracy due to ionization efficiencies, m/z deviation due to machine calibration, occlusion/adulteration of low-abundance signal due to dynamic range limitations, and compounded inaccuracies in mass-to-charge ratio (m/z) and abundance due to centroid construction. Of course, these complications are propagated from the clustering of isotope traces to the clustering of isotopic envelopes to the identification of cross-experiment correspondence.
A centroid is denoted as c = (µ, τ, α) where µ, τ, α are values for m/z, retention time (RT), and abundance, respectively. A single MS run produces a set of centroids , where n can readily reach into the millions.
where c α is the abundance of centroid c and c µ is the m/z of centroid c.
Note that the behavior of isotope traces are dependent on all three MS dimensions although many common approaches to isotope trace extraction ignore one or more of these dimensions. For example, most proprietary MS software uses hard m/z bins for isotope trace extraction.
Extant objective functions
with isotope trace-specific scaling parameter bF and translation parameter t F chosen to maximize the convolutional fit over isotope trace F .
for some intensity threshold θ and centroid distance function δ c , resulting in G being composed of one or more connected components, each considered one isotope trace. Thus, , where the neighborhood function ϒ (c) returns the set of nodes connected to c (and is symmetric because G is undirected).
The objective functions for massifquant and MaxQuant define as the set of all F formed by iterating over values of time t, and adding c if c τ = t and , where c∗ ∈ F and for all c j ∈ F. For massifquant, ∈ is prescribed by a Kalman filter induced from the variance in c µ and c α for all c j ∈ F such that , with the added constraint that c τ be unique in F . MaxQuant defines ∈ simply as a distance threshold of 7 ppm m/z.
Proposed objective functions
where, again, centroid clustering and retention time means F t are chosen to minimize the Gaussian fit error; however, rather than using a single global variance in the RT dimension, each isotope trace F has a local variance σ F ; in addition, the scaling factors have become time-dependent scalar functions b F (·). The second Gaussian factor, parameterized by mean F µ and variance function h(·), models the m/z width of the isotope trace, which is a function of the abundance α. Isotope traces splay at low abundance and narrow at high abundance; thus, both the variance h(·) and the scaling factors a F (·) are modeled as functions dependent on the abundance α. Note that while variance is trace-independent (depending only on abundance), each isotope trace has its own scaling function (which in turn is dependent on abundance).
Alleviating current limitations in isotopic trace extraction
Current objective functions for isotopic trace extraction fail to capture isotopic trace behavior formalized in this section: namely, a pattern of centroids forming a generally tight distribution through time around a specific m/z, with variation occurring as a factor of abundance, with normal abundance traces splaying at the beginning and end of elution, and lower abundance traces displaying high m/z variance in general. Moreover, isotope traces are skewed in time, with sharp onset of intensity followed by a post-peak long tail. The shape of traces is almost never strictly Gaussian (or even symmetric), as chromatography almost always deviates from the Gaussian in heading (which is more steep) and in tailing (which is less steep). Our objective functions account for each of these behaviors.
Isotopic envelope extraction
In other words, 1) all centroids are assigned to an isotope trace; 2) isotope traces can't share centroids. Because any sensor's detection of a physical system will deviate somewhat from the true physical system, we can expect MS detections to contain extraneous centroids. However, all signal ought to be accounted for (even if some identified "traces" eventually are identified as noise) and, in a platonic model, ought to be assigned to an isotope trace.
The choice of partitions φ and ψ is guided by a set of distance functions Δ that define distances between centroids, isotope traces, isotopic envelopes, etc. and objective functions λ F and λ E that describe "good" isotope traces and isotopic envelopes, respectively. The choice of distance and objective functions, along with choice of optimization procedure, characterizes an algorithmic approach for solving this clustering problem. A defining general property of isotopic envelopes, however, is the regular spacing between component isotope traces. In addition, for virtually all molecules from biological sources we expect that if there is an isotope with index j and an isotope with index j + 2, then there exists an isotope with index j + 1.
where is the uncharged molecular weight of the ion.
Every isotope trace consists of signal from at least one isotopic envelope, and, in the case of overlapping isotopic envelopes, an isotope trace may be composed of signal from more than one isotopic envelope.
Extant objective functions
where the G E compute a comparison between the (µ, τ, α) values for a centroid and the expected centroid values obtained from a heuristic isotopic envelope shape. Note that isotopic trace extraction is ignored.
where the notation c ∈ τ F means that c ∈ F at time τ, E is the maximal intensity (instantaneous) isotopic envelope (at time τ), is the ratio of the intensity of isotope trace F (at time τ) to the total intensity of all isotope traces F ∈ E (at time τ), and P m (·) is the value of the Poisson distribution at c µ .
Proposed objective functions
where F τ could be defined analogously to Equation 7, could be the maximum intensity for isotopic trace F or could be some other reasonable definition for isotopic trace elution time.
We want to optimize ε and the z E so that λ E is minimized; that is, we want to find charge-state/isotopic-envelope pairs such that the errors in expected m/z and co-elution time are minimized.
The isotopic envelope extraction segment of the MaxQuant  algorithm is one of the possible instantiations of this objective function, though many possibilities exist for how to set the allowable m/z and RT error and how to generate the prerequisite list of isotope traces.
Alleviating current limitations in isotopic envelope extraction
Isotopic envelopes are rich with data: the expectation of contiguous isotope traces with a uniform m/z charge gap, and similar maximal abundance across all isotope traces. Accounting for this behavior is not possible without adopting an isotope trace-centric approach to data extraction. Reliance upon maximal elution time alone--an approach that is susceptible to conflation with overlapping envelopes in complex samples--is not a sensitive approach in envelopes of lower abundance, where maximal elution times are not pronounced. Moreover, by first finding the isotope traces, the exact m/z of each isotope trace can be calculated using a weighted average, alleviating the need for larger than theoretically justified isotope trace gaps, which will not be sensitive in complex samples with overlapping isotopic envelopes. Instead, the proposed objective functions leverage a precise and reliable m/z charge gap and adjacency of isotope traces along with maximal elution times, using all the information in the data.
The combination of noise from within one run (enumerated above) and noise from run to run--most notable in retention time shifts, where an isotopic envelope appears at a different retention time or with a compressed or stretched RT length compared to another run--make LC-MS correspondence non-trivial.
The correspondence mapping should again optimize an objective function which, in turn, characterizes an algorithm choice for solving the correspondence problem.
Extant objective functions
where δ() τ,µ is a distance function defined over RT and m/z.
where D is the set of observed runs.
Proposed objective functions
In contrast to existing LC-MS correspondence objective functions, the objective functions suggested here use the entire isotopic envelope. This allows greater discrimination by using isotope trace quantity and spacing to match isotopic envelopes from different runs. This extra discrimination is essential given the amount of RT variance and (to a lesser degree) m/z variance present in the data.
Let R be a set of runs, each of which has an associated set of isotopic envelopes and let . We seek to find a binary equivalence relation ρ that induces a set of correspondence classes over that is reflexive (an envelope corresponds with itself), symmetric (if envelope E1 from run 1 corresponds with envelop E2 from run 2, then E2 also corresponds with E1) and transitive (if envelope E1 from run 1 corresponds with envelope E2 from run 2 and envelope E2 corresponds with envelope E3 from run 3, then E1 corresponds with E3); and if , then for k ≠ i, and for k ≠ j, (an envelope from one run may have 0 or 1 matches from any other run; note that due to reflexivity, this also means that two non-identical envelopes from the same run never correspond).
This relation should minimize
The difference in charge state between corresponding isotopic envelopes, .
The difference in m/z between isotope traces in corresponding isotopic envelopes, .
The difference in elution duration between isotope traces in corresponding isotopic envelopes, .
The difference in isotope abundance ratios between corresponding isotopic envelopes, .
The difference in m/z between corresponding isotopic envelopes, .
The number of singleton correspondence classes, .
The difference in retention time between corresponding isotopic envelopes, .
Alleviating current limitations in correspondence
Recently, several ubiquitous shortcomings were identified in a review of over 50 LCMS correspondence algorithms . The most significant of these shortcomings was the fact that all current LC-MS correspondence algorithms make model assumptions that fail to capture common behavior. In other words, each algorithm is constructed in such a way that the algorithm is guaranteed to get the wrong answer under certain conditions that are common to real LC-MS data. The behaviors discussed included the ideas that:
Not all analytes appear in all replicates.
Elution order can swap.
Shifts occur in m/z as well as in RT.
Some correspondence methods reduce isotopic envelopes to a single point representation. This deprives the method of a rich source of distinguishing data found in full isotopic envelopes--the expectation of contiguous isotope traces with a uniform m/z charge gap, number of isotope traces, and relative abundance ratio of isotope traces. Similarly, most correspondence algorithms conduct an initial RT alignment, where signals (almost always much-reduced from the full isotopic envelope, and rarely built up from isotope traces to isotopic envelopes) are shifted up or down in RT (preserving original order) in order to most closely match a reference run. This is invariably followed by direct matching. The problem is that the initial warping is a lossy procedure that adulterates the original RT time, which would be useful to probabilistically ascertaining the closest corresponding isotopic envelope.
The proposed objective function does not force matches between runs, as it is very common for species to either not be present or fall below the signal-to-noise ratio in differential studies. Instead, the proposed objective function leverages the full breadth of isotope envelope information, allowing a rigorous direct comparison of candidate correspondences based on all available data to select the most likely correspondence (in the sense of minimizing error), or no correspondence at all if that is the most likely case given the data.
We present a concise attempt to formalize LC-MS data clustering problems, describing the constructs of isotope traces and isotopic envelopes and their relational structure. We provide a review of current approaches to isotope trace extraction and LC-MS correspondence, and propose novel objective functions for both tasks that address shortcomings in current methods.
- Choi H, Fermin D, Nesvizhskii AI: Significance analysis of spectral count data in label-free shotgun proteomics. Mol Cell Proteomics. 2008, 7 (12): 2373-2385. 10.1074/mcp.M800203-MCP200.PubMed CentralView ArticlePubMedGoogle Scholar
- Wiese S, Reidegeld KA, Meyer HE, Warscheid B: Protein labeling by iTRAQ: a new tool for quantitative mass spectrometry in proteome research. Proteomics. 2007, 7 (3): 340-350. 10.1002/pmic.200600422.View ArticlePubMedGoogle Scholar
- Michalski A, Cox J, Mann M: More than 100,000 Detectable Peptide Species Elute in Single Shotgun Proteomics Runs but the Marjority is Inaccessible to Data-Dependent LC-MS/MS. Journal of Proteome Research. 2011, 10: 1785-1793. 10.1021/pr101060v.View ArticlePubMedGoogle Scholar
- Cappadona S, Baker PR, Cutillas PR, Heck AJ, van Breukelen B: Current challenges in software solutions for mass spectrometry-based quantitative proteomics. Amino Acids. 2012, 43 (3): 1087-1108. 10.1007/s00726-012-1289-8.PubMed CentralView ArticlePubMedGoogle Scholar
- Tautenhahn R, Bottcher C, Neumann S: Highly sensitive feature detection for high resolution LC/MS. BMC Bioinformatics. 2008, 9 (1): 504-10.1186/1471-2105-9-504.PubMed CentralView ArticlePubMedGoogle Scholar
- Pluskal T, Castillo S, Villar-Briones A, Oresic M: MZmine 2: Modular framework for processing, visualizing, and analyzing mass spectrometry-based molecular profile data. BMC Bioinformatics. 2010, 11 (1): 395-10.1186/1471-2105-11-395.PubMed CentralView ArticlePubMedGoogle Scholar
- Conley CJ, Smith R, Torgrip RJ, Taylor RM, Tautenhahn R, Prince JT: Massifquant: open-source Kalman filter based XC-MS isotope trace feature detection. Bioinformatics. 2014, 359-Google Scholar
- Cox J, Mann M: MaxQuant enables high peptide identification rates, individualized ppb-range mass accuracies and proteome-wide protein quantification. Nature Biotechnology. 2008, 26 (12): 1367-1372. 10.1038/nbt.1511.View ArticlePubMedGoogle Scholar
- Weisser H, Nahnsen S, Grossmann J, Nilse L, Quandt A, Brauer H, Sturm M, Kenar E, Kohlbacher O, Aebersold R, et al: An automated pipeline for high-throughput label-free quantitative proteomics. Journal of Proteome Research. 2013Google Scholar
- Bellew M, Coram M, Fitzgibbon M, Igra M, Randolph T, Wang P, May D, Eng J, Fang R, Lin C, et al: A suite of algorithms for the comprehensive analysis of complex protein mixtures using high-resolution LC-MS. Bioinformatics. 2006, 22 (15): 1902-1909. 10.1093/bioinformatics/btl276.View ArticlePubMedGoogle Scholar
- Smith R, Ventura D, Prince JT: LC-MS Alignment in Theory and Practice: A Comprehensive Algorithmic Review. Briefings in Bioinformatics. 2013Google Scholar
- Listgarten J, Neal RM, Roweis ST, Wong P, Emili A: Difference detection in LC-MS data for protein biomarker discovery. Bioinformatics. 2007, 23 (2): 198-204. 10.1093/bioinformatics/btl553.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.