For decades, mass spectrometry data has been analyzed to investigate a wide array of research interests, including disease diagnostics, biological and chemical theory, genomics, and drug development. Progress towards solving any of these disparate problems depends upon overcoming the common challenge of interpreting the large data sets generated. Despite interim successes, many data interpretation problems in mass spectrometry are still challenging. Further, though these challenges are inherently interdisciplinary in nature, the significant domain-specific knowledge gap between disciplines makes interdisciplinary contributions difficult.
This paper provides an introduction to the burgeoning field of computational mass spectrometry. We illustrate key concepts, vocabulary, and open problems in MS-omics, as well as provide invaluable resources such as open data sets and key search terms and references.
This paper will facilitate contributions from mathematicians, computer scientists, and statisticians to MS-omics that will fundamentally improve results over existing approaches and inform novel algorithmic solutions to open problems.
Robust data processing tools for MS data are lagging behind the substantial advances occurring in instrumentation and protocol . One reason for this is that few outside experts--mathematicians, computer scientists, and statisticians--have climbed the learning curve (usually requiring several years of dedicated study) to understand the terminology, chemical theory, workflows, and challenges of MS-omics (proteomics, lipidomics, and metabolomics). This sort of interdisciplinary learning curve is not unusual in bioinformatics; however, the influx of external experts to genomics has not been seen to date in MS-omics. One reason for this is the lack of a succinct and cogent introductory resource that can bring outside experts to a basic but functional level of MS-omics familiarity.
In this primer, we will elucidate the mechanisms of MS-omics, the problems it is used to solve, key concepts and terms found in the literature, and open problems and their salient literature. The purpose of this tutorial is to expedite the new researcher's acquisition of a functional knowledge of MS-omics sufficient for contribution to the field.
Results and discussion
Relationship of genomics, proteomics, lipidomics, and metabolomics
The exponential growth of genomics studies during the last ten years has not been matched by corresponding research in MS-omics . Genomics researchers have several peer-reviewed conferences in which to publish their results. To the best of our knowledge, there has not been a single peer-reviewed conference to date on lipidomics or metabolomics, let alone any specifically addressing algorithmic approaches to problems specific to either area, although there are periodic special genomics conferences dedicated to proteomics. Several existing venues labeled as bioinformatics will not accept papers on MS-omics, as their stated area of interest is limited to a distinct subfield of bioinformatics such as genomics. This phenomenon of focus on genomics is also reflected in institutional research programs. In a recent review of 78 post-secondary degree-granting bioinformatics programs, 22 programs noted a research emphasis in genomics, while 18 noted a research emphasis in proteomics. Not a single institution listed a research program in lipidomics or metabolomics .
The biological reach and impact of research in MS-omics is so extensive that it can be argued that MS-omics should now be the highest priority of systems biology . From a pragmatic perspective, the large set of fresh problems and substantial potential for impact in MS-omics ought to be very attractive to those in more crowded disciplines.
Proteomics is the study of biological processes via the analysis of protein expression or state in cells or tissue. Proteins are ubiquitous building blocks of life, and they are composed of peptides, which are chains of amino acids built by translating mRNA. There are 20 amino acids, uniquely abbreviated with a single letter. Peptides thus can be described as a string of the letters corresponding to the amino acids. Though protein sequences are determined by DNA sequences, post translational protein modifications (such as acetates, phosphates, lipids etc.) are not as easily predicted. These modifications quickly diversify and regulate/complicate protein function and cellular protein composition and are characteristic in most cellular processes and diseases. Therefore, the aim of MS-proteomics is to provide data that DNA sequences cannot--namely, individual protein concentrations and identification of post-translational modifications.
Lipidomics is the systems-level analysis of lipids (fat molecules) and their interactions . It is a science still in its infancy but one that promises to revolutionize biochemistry . Lipids are grouped into eight categories that share common physical and chemical properties [4, 6], and there are currently some 38,000 documented lipids.
Lipids that occur rarely or in small quantities are often the most effectual lipids in biological processes, meaning they are particularly important in disease diagnostics and in understanding pathology . Lipidomics can elucidate the pathology and treatment of many diseases such as cancer, diabetes, obesity, cardiovascular disease, arthritis, asthma, inflammatory bowel disease, Alzheimer's and others due to the associated disruption of lipid metabolic enzymes and pathways [7, 2, 8, 5]. A better understanding of lipidomics could significantly advance diagnostic medicine as well as provide novel treatment options.
Metabolomics is the study of metabolomes--small molecular end products of cellular regulatory pathways  that can provide a snapshot of cell physiology. Metabolites are much smaller than proteins and smaller than most lipids. Their small size precludes the direct overlap of some techniques used in proteomics or lipidomics, but they may be generally analyzed in similar ways. Lipids may be classified as a subset of metabolites; however, mass spectrometrists typically consider lipids distinct from metabolites because analytically they must be treated separately (i.e., require different solvents).
The workflow from sample preparation to result quantification, can be split into two consecutive pipelines: the wet-lab pipeline and the data processing pipeline. The data processing pipeline consists of many possible processing steps that take the data resulting from the wet-lab pipeline (the mass spectrometer output) to the end result: identification and quantification (see Figure 1). The quality of each step in the pipeline affects the sensitivity and reliability of the outcome . There are many optional steps, some of them very popular. We will describe the essential and some optional steps.
All MS experimental data share a set of descriptive keywords that are essential for referencing components of the output map (see Figure 2). A comprehensive reference of key MS terms is provided in .
The details of sample preparation are beyond the scope of this paper. However, at a general level, sample preparation strategies prior to mass spectral analysis are based on isolating analytes of interest and removing all other contaminating molecules. For instance, filters can be used to separate high molecular weight proteins from low molecular weight lipids and metabolites, or contaminates. Other sample preparation techniques exploit analyte hydrophobicity, charge, and analyte-specific affinity. The degree of specificity in sample preparation is determined by the end goal of the experiment . For example, if an experiment requires the analysis of only phosphorylated proteins, the sample preparation should isolate only phosphorylated proteins. Of course, this is very challenging but using an appropriate sample preparation strategy specific to an experimental need significantly simplifies mass detection and data analysis and in some cases is required to identify analytes of interest. Proteomics, lipidomics, and metabolomics each have unique considerations in sample preparation.
Direct injection refers to infusing the sample directly into the mass detector. This is usually done with some sort of machine to make the flow constant.
While it is sometimes advantageous to allow all analytes to flow through detection at once, most MS experiments of complex samples will use chromatography due to its ability to spread out analytes over time, making it less likely that the ionization capacity will be overcome by large quantities of analyte or background ions, a phenomenon called ion suppression.
Chromatography disperses the introduction of analytes into the mass detector through time based on some chemico-physico property (hydrophobicity, for instance). All chromatography systems have two phases: the stationary phase and the mobile phase. The stationary phase causes analyte separation and the mobile phase carries the analytes through the chromatographic column to the mass spectrometer. Methods include:
LC-MS - mass spectrometry coupled to liquid chromatography. Liquid chromatography uses a liquid mobile phase and a column packed with chemically derivated beads as a stationary phase. The mobile phase is composed of a two-liquid gradient. Changes in the gradient (the percent composition of each liquid) cause analytes to be slowly released from the column and enter the mass spectrometer. Different stationary phases can separate analytes based on hydrophobicity, charge, size, or affinity. However, the most common stationary phases for LC-MS on biomolecules are reversed phase (hydrophobic) and strong cation (charge) .
GC-MS - mass spectrometry coupled to gas chromatography. In gas chromatography systems the mobile phase is an inert gas (such as helium) and the stationary phase is a column designed to separate molecules based on polarity. The gradient is temperature increase; molecules with a high affinity for the column elute at higher temperatures.
CE-MS - mass spectrometry coupled to capillary electrophoresis. Electrophoresis differs from chromatography, relying on electric fields, rather than mobile and stationary phases, to separate molecules . Capillary electrophoresis uses an electric field applied to long narrow capillaries to separate molecules based on size, charge, and flow resistance through the capillary.
Multidimensional chromatography (sometimes referred to as tandem chromatography) refers to two chromatographic systems applied to the same system. In the case of LC-GC-MS, for example, analytes are introduced into the gas chromatography system as they elute from the LC system, with each system causing analytes with specific properties to elute with precedence. A more common multidimensional system in MS-omics is MUDPIT. MUDPIT uses two orthogonal separation strategies like strong cation ion exchange (charge based) and reversed phase (hydrophobicity based) chromatography to achieve greater resolution.
Analytes must be ionized (i.e., in a charged state) in order to be detected by the mass spectrometer. Electrospray ionization (ESI) was developed in 1994 and is the most popular in MS-omics due largely to its ability to ionize unstable molecules without breaking chemical bonds and to the diverse range of analytes that can be ionized by the method [15, 16]. Other methods include atmospheric pressure chemical ionization (APCI) , matrix-assisted laser/desorption ionization (MALDI) , and electron-ionization (EI) . Ionization methods for ms-omics are generally referred to as soft ionization methods and include ESI and MALDI. EI is a harsh ionization method and will destroy most biomolecules except for very stable lipids and metabolites.
As charged particles are passed through the mass spectrometer, the mass-to-charge ratio (m/z) of detected particles is registered. A single scan on the resulting output represents a snapshot of the precursor ions passing through the mass spectrometer at that particular retention time (RT). The ions in this stage are called precursor ions because in tandem mass spectrometry (MS/MS), ions in small m/z windows are captured for fragmentation and MS detection a second time, yielding a second set of ions called product ions that can be used to identify precursor ions by matching their MS/MS patterns to a database of possibilities. It is important to understand that the ratio of solution selected for MS/MS fragmentation is low, normally capturing only 10-20% of the precursor (MS1) data. Because most MS/MS systems autoselect what segments to capture based on intensity, much of that portion overlaps between replicates. Of that 10-20%, less than 60% are identified via database lookup, and even that is subject to false positive identifications .
An analyte can contain certain naturally occurring rare isotopes, such as carbon-13. These isotopes tend to occur in individual analytes in known quantities, causing a characteristic pattern called an isotopic envelope (see Figure 2). The envelope is characterized by the number of and relative intensity between its isotopes. The monoisotopic peak, or peak that appears at the theoretical mass discounting any attached heavy isotopes, usually appears alongside the slightly heavier masses of any portion of the peptide or lipid in the sample that contains heavy isotopes.
When an analyte exists in a run in more than one charge state (a very common occurrence due to variability in ionization), its isotopic envelope will reappear in a compressed and shifted form due to increased charge, as illustrated in Figure 3. The equation for the shift is specific to the source of the charge. For instance, a charge can be induced by the addition of a proton, in which case the shift is defined by (µ + k)/charge m/z with a gap between ions in the isotopic envelope of 1/k, where k is the charge of the analyte (3+, 2+, 1+, and 1+, respectively in Figure 3) and µ is the m/z of the single-charged analyte (this is the analyte with only a +1 charge--399 in Figure 3).
Mass spectrometers output raw data--a large collection of data points each consisting of a tuple of m/z, intensity, and time (RT) either in profile or centroid form. Profile data contains all data points registered by the mass spectrometer (see Figure 4a), while centroid data has been reduced to data points that represent the local maxima in a single spectrum, a distribution of data over an m/z range for a given RT (see Figure 4b). Centroid data is much more concise than profile data, but the reduction incurs information loss.
Experiments can run in full scan mode--where the full range of m/z values is read--or the mass spectrometer can scan only certain m/z values (called single reaction monitoring in the case of one m/z value or multiple reaction mode in the case of several) .
Mass spectrometers have varying characteristics depending on the mechanisms used for mass detection, each with a different resolution. Resolution at a certain m/z is given by the ratio of that m/z to the smallest m/z gap between two distinguishable ions. Higher resolution instruments yield narrower profile peaks (see Figure 4a), allowing the signals from two distinct ions to be distinguished despite their similarity in m/z.
Data processing consists of each of the possible steps in the MS-omics pipeline (Figure 1) involving digital manipulation of the mass spectrometer data or products from that data. These methods are constantly being improved upon and are discussed in detail in Section. Here, we provide a high-level overview of the role of data processing in the MS-omics pipeline.
The first step in data processing is handling the raw data produced by the mass spectrometer. Algorithms for noise reduction, feature detection, and correspondence exist that operate on the raw data. However, many require preliminary conversion out of the proprietary data format of the instrument and into an open data type (see below for a discussion of existing data types). It is important to note that, due to the size of the data sets, random access data processing--where only a portion of the data file is loaded into memory at a time--is a must, although some current tools load the full file and are therefore prone to crashing and subject to file size limits as memory is exhausted.
Prior to analyte identification, the data must be denoised, peak-picked, featuredetected, deisotoped, and deconvoluted. These are significant and open problems and are discussed in more detail below.
Analyte identification follows data processing. Here, one of several available databases are used to compare the experimental feature observations (i.e. isotopic envelopes, isotopic traces, etc.) to theoretical patterns. These include Sequest  for proteins, LIPIDMAPS  for lipids, and METLIN  for metabolites. Due to incomplete/growing databases and noisy data, closest-match assignment is prone to false positives and mismatches. Statistical analysis is almost always incorporated in this or prior steps in order to ascertain the significance of the identification.
The ultimate goal of data processing is to yield the quantity of each analyte. The identification and quantity of analytes, as well as the underlying raw data, must be stored in data structures that allow for efficient access and manipulation of the data.
Raw data is a general label that actually describes a set of data formats specific to the vendor of the instrument. Many data converters from raw to open data formats exist. One popular converter is pwiz (http://proteowizard.sourceforge.net/). The Network Common Data Form (NetCDF), a generic open science data format, is an early data format that is still in use in some applications. mzXML is an open XML based data format with wide support. mzML was developed to replace mzXML and has more information from the raw data encoded and uses extensible ontologies to encode meta-data. mzQuantML is an open data format specifically intended for the storage of quantities associated with identified feature data. mzIdentML and pepXML are standards designed to facilitate database identity searches. Annotated Putative Peptide Markup Language (APML) is an XML standard designed to provide a single data file encoding of the original data set and its modifications via data processing tools .
Lack of labeled data
The prevailing problem in developing and evaluating computational approaches to MS-omics problems is the lack of labeled data . Labeled data is difficult to obtain both because of the size of data sets--which can easily consist of millions of data points per file and hundreds of GBs of files for a replicate experiment series--and the undependability of hand-labeling--which is both time consuming and subjective. Several approaches for mitigating this problem exist: qualitative metrics, spiked mixtures, and in silico simulated data.
Qualitative metrics Evaluation metrics that do not use ground truth avoid the need for labeled data. For example, replicate alignment quality can be assessed via the Pearson correlation coefficient, feature overlap rate, or coefficient of variation. This approach is sub-optimal, as a good score on a qualitative metric does not necessarily translate into a good quantitative score using labeled data, but it is easy to compute and is comparable across problem instances.
Spiked mixtures Commercially available purified and quantified measures of a specific analyte are combined to produce a data set with known composition and quantity. These samples are not exactly ground truth, however. Due to ionization inefficiencies, environmental contaminants, and the variability of mass spectrometry, no instrument will report the same quantity and composition predicted by a spiked mixture. What's more, a mixture of a few analytes, which often do not co-occur in nature, is hardly representative of real-world scenarios, in which complex samples can easily contain hundreds of thousands of distinct analytes. To create more realistic conditions, spiked mixtures can be added to samples where the spiked an-alytes are not expected to occur. However, a method's accuracy on a few analytes is not necessarily indicative of performance across all analytes, particularly given the variability and limitations of MS/MS, which is commonly used to single out the m/z of the expected analytes but cannot be expected to capture the gross majority (≈ 80 − 90%) of the remaining sample.
In silico simulated data Simulated data is used in the field to refer to real-world data sets that have been purtubed with m/z shifts or intensity value modifications in order to create psuedo-new data without having to rerun costly experiments. True simulated data, called in silico to identify that it as purely sourced from simulation algorithms on a computer, is a relatively new advent in MS-omics. Creating realistic in silico data requires the analysis of many ground truth datasets, which creates a chicken and egg problem, as the difficulty of obtaining ground truth datasets is the very reason an in silico simulator would be beneficial.
Sources of open data
To facilitate strictly algorithmic advances in MS-omics, to avoid the need for a costly wet lab for creating mass spectrometry data, and to aid in evaluative comparisons against existing methods, more and more practitioners are making their data freely available online. Although any serious foray into MS-omics should certainly include a collaborator with mass spectrometry assets and formal training, we present a list of some of these open data sets in order to aid those who are interested in investigating MS-omics for the first time as well as more seasoned investigators who would simply like to make a case for the generality of their methods.
Lange et al. have provided two proteomic and two metabolomic data sets  which they have used to assess the quality of several alignment algorithms at http://msbi.ipb-halle.de/msbi/caap. The data is already segmented into reduced isotopic envelopes (isotopic envelopes whose isotopic traces are integrated into a single point).
Listgarten et al. provide centroided replicate data with spiked-in peptides . There are two data sets: a set of 11 replicate LC-MS runs from ruptured E. Coli cells and a set of 14 LC-MS runs of human serum samples.
The SuperHirn data set  can be found at http://proteomics.ethz.ch/muellelu/web/Latin_Square_Data.php. It consists of 18 LC-MS runs from tryptic digests of 6 nonhuman proteins spiked with different concentrations into a complex human peptide sample and includes the raw as well as processed data. The data was obtained on an FT-LTQ.
Problems of interest
Among the data processing portion of the MS-omics pipeline, some problems are widely studied, and some are emerging. All provide future research potential.
In silico simulation
The lack of ground truth data for evaluation of data processing algorithms precludes effective validation and comparison. In silico data simulation is a relatively new approach to providing on demand ground truth simulated data. By modeling a list of analytes and a description of experimental conditions, simulators can provide estimates of mass spectrometer output combined with labels of the analytes and quantities used in silico to generate the data (see [28–31]).
Correcting mass shift
Analyte detection on the m/z axis in mass spectrometers is subject to two types of error: systematic mass error--a functional deviation from true mass--and random mass error . Typically, systematic mass error is mitigated by routine machine recalibration--a process wherein analytes of known mass are processed in the mass spectrometer to create a model that is used to interpolate m/z shift for any given m/z value. However, the efficacy of this calibration reduces over time as the mass constantly continues to shift. Additionally, some machines benefit from an injection of spiked standards during a normal experiment for internal calibration, which helps overcome the temporal effects of space charge effects, electric fields, peak intensity, and temperature . Internal standards are undesirable due to the additional cost of standards and the suppression implications of spiked standards. Computational mass calibration techniques have been proposed in order to provide the mass accuracy of internal calibration but with better consistency and lower cost . This is an active but not crowded area of research with practical implications.
Correspondence, the registration of recurring signals from the same analyte over replicate samples, is a crucial problem in any of the many MS experiments where multiple runs of similar samples are compared to each other (see Figure 5). For a comprehensive review of current algorithms, see . Persisting problems are an abundance of user parameters, models that do not include known behavior, prohibitively long runtimes, and a lack of performance comparison between methods .
MS-omics produces inherently noisy data. Noise can consist of spurious data points or distortion of a data point's true value in retention time, m/z, or intensity. Denoising as used in MS-omics refers to the removal of spurious data points. Baseline subtraction is a common method in which signals with intensity lower than an adaptive threshold are considered to be noise and removed (see Figure 6). This is an active area of research, though most experiments in the literature have not made an explicit and dedicated study of different techniques, instead describing the denoising method applied as a data processing step in a larger experiment.
The most important step of an MS-omics workflow is undoubtedly feature detection , a general term that can apply to the extraction of various signal elements from MS data. In chromatographic data, feature detection can refer to either extracting isotopic envelopes or isotopic traces from an MS sample output (see Figure 7). Many methods exist for isotope trace extraction, among them a promising new algorithm that performs well on existing evaluations . Sometimes this process is called peak picking or peak detection, but those terms should be avoided since they are also used to refer to the conversion from profile data to centroid data. In direct injection data, feature detection is sometimes referred to as peak summarization, since each spectra (being an approximation of the latent content of the non-chromatographically separated sample) must be combined into a TIS through mitigating the variance inherent in m/z across spectra (see ).
As described earlier, many data types exist for MS-omics data. New data formats continue to be proposed to meet unforeseen needs.
A recent prevailing expansion point has been the need to store the results of data processing tools in addition to the original data. Truly modular pipelines require data structures that contain all necessary data to be used by any tool in the pipeline, meaning previous modifications are annotated in addition to retention of the original data. APML is one attempted solution to this problem, but, so far, the community has not embraced it, as it appears that there are only two extant algorithms which use it .
There is still a need for compact, random access, and information rich data structures and access for MS data . What's more, some proprietary formats can still only be converted to open formats on Windows platforms.
As discussed earlier, mass spectral identifications may be based on several factors, but two inputs, the precursor mass (the mass of the molecule) and the fragmentation pattern (through MS/MS) of the precursor mass, are by far the most common identifiers. This spectral information provides a fingerprint unique to most biological molecules; however, low quality spectra cause false positives and false negatives. While improving mass spectrometry will certainly improve spectral quality, improving spectral search algorithms and employing new identification inputs will allow for more confident identifications. This is particularly true for the relatively new fields of metabolomics and lipidomics.
Retention time refers to the amount of time an analyte is delayed by chromatography before exiting and being detected by the mass spectrometer. Retention time is correlated with physical and chemical analyte characteristics; therefore, predicting analyte retention time provides another factor for positive identification. Many peptide retention time prediction strategies exist . However, cross instrument retention times vary greatly due to changes in experimental parameters, creating a real need for retention time normalization as well as retention time prediction.
Mass variance correction
Mass variance, the difference between the theoretical and experimental (observed) mass of analytes is an open problem. One way of correcting mass variance is by using the weights of the elements of each analyte to predict m/z locations where a lack of signal is impossible, allowing for the identification of systematic deviation from theoretical masses in a sample . A similar approach is to model such theoretical gaps via a sine curve fitted via a fast Fourier transform . Accurate m/z values are essential to analyte identification.
According to a recent survey of the field, the biggest problem in lipidomics is the need for a standardization of data acquisition and data processing, due to the huge variability in instruments, protocol and data processing for lipidomics. The many options and permutations in the MS pipeline would make for a very long methods section if explicitly described in a paper--much too long for any journal's page limits. Although several partial ontologies exist (see [41, 42]), there is no concise way to uniquely identify an experiment from start to finish, including sample preparation, mass spectrometry protocol, and post-processing. Existing ontologies are particularly lacking in terms of data processing terms. TODO cite clarity in concepts
MS signal intensity is related to but not equivalent to analyte quantity [43, 44]. Factors that influence this discrepancy include :
Ionization efficiency. Not all analytes in a sample are ionized.
Enzyme digestion rate. When an enzyme--such as trypsin--is used to digest proteins into peptides, not all proteins are completely cleaved. This leads to less-than-expected signal abundance, as the true abundance will be diminished by whole proteins (which are not ionized and therefore not detected), and incompletely digested proteins (which will be detected at different m/z than the expected peptide components).
Ion suppression. When the quantity of analyte entering the ionization mechanism at a given time exceeds the ionization capacity of the ionization mechanism, only a portion of the analyte is charged .
Accurate models of these effects would improve estimates of analyte population in samples, as well as further advance in silico simulation.
Currently, quantification methods generally fall into one of three approaches: label free spectral counting, quantification via differential stable isotopes, and label free quantification based on the precursor ion signal intensities . Spectral counting is a method in which peptide signals are used to create a protein tally--the count of every protein containing a certain peptide is incremented each time one of its peptides is identified via MS/MS. Despite its prevalence, the accuracy of spectral counting is limited by its dependence on MS/MS acquisition rates, which, as mentioned above, are very low, and its propensity for false positives, since all proteins containing each detected peptide are considered as present when in reality only one need be. Stable isotope labeling methods (SILAC, ICAT, iTRAQ, and TMT) also have significant limitations (see ). Besides cost and sample prep complications, nearly all methods increase the number of co-eluting analytes, creating a bottleneck for the complexity of samples handled. What's more, because stable isotope methods target a small specific list of analytes a priori, they are not practical in terms of time and money for data-driven discovery, where sample composition is unknown .
Modeling dynamic range suppression effect
Dynamic range is a term that describes the minimum intensity of a detectable signal given a co-eluting analyte of a higher intensity (see Figure 8). All mass spectrometers have a dynamic range limitation. The current state of the art is 103 - 104, meaning that at a given RT if one analyte has an intensity of 1.3 × 105, any analyte with an intensity less than 1.3 × 102 would not be detected.
Fragment ion intensities
Because MS/MS acquisition captures not just the analyte of interest but also any surrounding precursor ions, and because fragmentation isn't a perfect process, fragment ion intensities are not as accurate as desired [50, 4]. Several machine learning approaches have been proposed for making more accurate fragment identifications [51, 52]. However, this is still an open problem.
De novo peptide sequencing
De novo sequencing is an alternative method to database matching that accommodates peptides that don't match up with the database (caused by mutations, polymorphisms, modified amino acids or simply a missing database entry) . Here, the original peptide sequence--defined by a series of letters, each representing an amino acid--is reconstructed based on the MS/MS fingerprint and the chemical properties of the analytes. A recent tutorial provides more detail and resources .
Fragmentation patterns for lipids
Proteins have a known cleavage pattern, meaning that when peptides are fragmented by MS/MS, association to a peptide is straightforward. Lipids, on the other hand, have a much more complex form due to a wider vocabulary of building blocks and a more complicated fragmentation pattern. To date, no fragmentation rules have been published, making MS/MS much less helpful in lipidomics than proteomics. Because of the complexity of lipids, a machine learning approach could be appropriate in finding a solution to this problem.
Biomarker discovery is the use of comparative analysis (see Figure 9) in order to identify analytes that correlate with certain diseases or other conditions for diagnostics or drug development. It is an active area of research with a lot of published work; however the problem is still wide open due to limitations in mass spectrometry, preprocessing, and identification. Current methods struggle to highlight case/control differences in complex samples, requiring painstaking, time consuming, and error-prone manual detection.
Deisotoping is the process of reducing several instances of the same analyte at different charge states into a single feature--usually a monoisotopic peak (see Figure 3). This is necessary because the query to a data base search consists of only the single-charged feature m/z and (optionally) RT. Adding to the complexity of registering differently charged versions of the same analyte is the fact that, in complex samples, the isotopic envelopes of different analytes can and do overlap, requiring deconvolution (see below).
Overlapping signals must be resolved prior to quantification (see Figure 10). RT overlaps occur when two isobaric analyte elute without a gap between them, and are more common in complex samples. Isotopic envelope overlaps occur in m/z where two analyte are not sufficiently separate in m/z at their current charge state. Ion overlaps occur when particular ions of two given analyte are too similar to be resolved in m/z. All m/z overlaps are less likely in high resolution machines, which by definition are capable of better resolving power evinced by more narrow signals in m/z. RT overlaps can be minimized to some extent by sample preparation and protocol designed to separate similar molecules into different RT areas.
In general, most algorithms require the user to optimize a host of parameters through manual tuning, which is time intensive. New algorithms should avoid free parameters. If included, they should also provide guidance or an automated method to fix them. Research opportunities include developing methods for automatically optimizing parameters on existing and popular methods.
MS-omics is an exciting, developing field with many research opportunities for mathematicians, computer scientists, and statitisticians. Although contribution to the field requires a functional understanding of many domain-specific concepts and terms, the open nature of most of the existing problems provides many opportunities for impact.
RS is supported by the NSF graduate research fellowship (DGE-0750759).
Hemminger BM, Losi T, Bauers A: Survey of bioinformatics programs in the United States. Journal of the American Society for Information Science and Technology. 2005, 56 (5): 529-537. 10.1002/asi.20141.
Fahy E, Subramaniam S, Murphy RC, Nishijima M, Raetz CR, Shimizu T, Spener F, van Meer G, Wakelam MJ, Dennis EA: Update of the LIPID MAPS comprehensive classification system for lipids. J Lipid Res. 2009, 50 (Supplement): 9-14. 10.1194/jlr.R800095-JLR200.
Kraegen EW, Cooney GJ, Ye JM, Thompson AL, Furler SM: The Role of Lipids in the Pathogenesis of Muscle Insulin Resistance and Beta Cell Faiture in Type II Diabetes and Obesity. Experimental and Clinical Endocrinology & Diabetes. 2001, 109 (Suppl 2): 189-201. 10.1055/s-2001-18581.
Morris M, Watkins SM: Focused metabolomic profiling in the drug development process: advances from lipid profiling. Current Opinion in Chemical Biology. 2005, 9 (4): 407-412. 10.1016/j.cbpa.2005.06.002.
Dakna M, He Z, Yu WC, Mischak H, Kolch W: Technical, bioinformatical and statistical aspects of liquid chromatography/mass spectrometry (LC-MS) and capillary electrophoresis-mass spectrometry (CE-MS) based clinical proteomics: A critical assessment. Journal of Chromatography B. 2009, 877 (13): 1250-1258. 10.1016/j.jchromb.2008.10.048.
Han X, Gross RW: Electrospray Ionization Mass Spectroscopic Analysis of Human Erythrocyte Plasma Membrane Phospholipids. Proceedings of the National Academy of Sciences of the United States of America. 1994, 91 (22): 10635-10639. 10.1073/pnas.91.22.10635.
Michalski A, Cox J, Mann M: More than 100,000 Detectable Peptide Species Elute in Single Shotgun Proteomics Runs but the Marjority is Inaccessible to Data-Dependent LC-MS/MS. Journal of Proteome Research. 2011, 10: 1785-1793. 10.1021/pr101060v.
Eng JK, McCormack AL, Yates JR: An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. Journal of the American Society for Mass Spectrometry. 1994, 5 (11): 976-989. 10.1016/1044-0305(94)80016-2.
Smith CA, O'Maille G, Want EJ, Qin C, Trauger SA, Brandon TR, Custodio DE, Abagyan R, Siuzdak G: METLIN: a metabolite mass spectral database. Therapeutic Drug Monitoring. 2005, 27 (6): 747-751. 10.1097/01.ftd.0000179845.53213.39.
Smith R, Prince JT: JAMSS: Proteomics mass spectrometry simulation in java. Bioinformatics (in review).
Noyce AB, Smith R, Dalgliesh J, Taylor RM, Erb KC, Okuda N, Prince JT: Mspire-Simulator: LC-MS Shotgun Proteomic Simulator for Creating Realistic Gold Standard Data. Journal of Proteome Research. 2013, 10.1021/pr400727e.
Schulz-Trieglaff O, Pfeifer N, Gröpl C, Kohlbacher O, Reinert K: LC-MSsim-a simulation software for liquid chromatography mass spectrometry data. BMC Bioinformatics. 2008, 9 (1): 423-10.1186/1471-2105-9-423.
Egertson JD, Eng JK, Bereman MS, Hsieh EJ, Merrihew GE, MacCoss MJ: De Novo Correction of Mass Measurement Error in Low Resolution Tandem MS Spectra for Shotgun Proteomics. Journal of The American Society for Mass Spectrometry. 2012, 1-8.
Sugimoto M, Kawakami M, Robert M, Soga T, Tomita M: Bioinformatics Tools for Mass Spectroscopy-Based Metabolomic Data Processing and Analysis. Current Bioinformatics. 2012, 7 (1): 96-10.2174/157489312799304431.
Taylor CF, Hermjakob H, Julian RK, Garavelli JS, Aebersold R, Apweiler R: The work of the human proteome organisation's proteomics standards initiative (HUPO PSI). Omics: a Journal of Integrative Biology. 2006, 10 (2): 145-151. 10.1089/omi.2006.10.145.
Whetzel PL, Parkinson H, Causton HC, Fan L, Fostel J, Fragoso G, Game L, Heiskanen M, Morrison N, Rocca-Serra P: The MGED Ontology: a resource for semantics-based description of microarray experiments. Bioinformatics. 2006, 22 (7): 866-873. 10.1093/bioinformatics/btl005.
Li XJ, Zhang H, Ranish JA, Aebersold R: Automated statistical analysis of protein abundance ratios from data generated by stable-isotope dilution and tandem mass spectrometry. Analytical Chemistry. 2003, 75 (23): 6648-6657. 10.1021/ac034633i.
Mueller LN, Brusniak MY, Mani DR, Aebersold R: An Assessment of Software Solutions for the Analysis of Mass Spectrometry Based Quantitative Proteomics Data. Journal of Proteome Research. 2008, 7 (1): 51-61. 10.1021/pr700758r.
Wang W, Zhou H, Lin H, Roy S, Shaler TA, Hill LR, Norton S, Kumar P, Anderle M, Becker CH: Quantification of proteins and metabolites by mass spectrometry without isotopic labeling or spiked standards. Analytical Chemistry. 2003, 75 (18): 4818-4826. 10.1021/ac026468x.
Elias JE, Gibbons FD, King OD, Roth FP, Gygi SP: Intensity-based protein identification by machine learning from a library of tandem mass spectra. Nature Biotechnology. 2004, 22 (2): 214-219. 10.1038/nbt930.
The authors declare that they have no competing interests.
RS, ADM, DV, and JTP all contributed in writing this manuscript.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Smith, R., Mathis, A.D., Ventura, D. et al. Proteomics, lipidomics, metabolomics: a mass spectrometry tutorial from a computer scientist's point of view.
(Suppl 7), S9 (2014). https://doi.org/10.1186/1471-2105-15-S7-S9