GC-MS is a robust and sensitive platform for the profiling of certain metabolite classes [1, 4, 12, 13, 15]. Due to experimental limitations inherent in all chromatographic separations, slight drifts are observed in GC-MS elution times between experiments [17, 18]. These drifts are particularly problematic for non-targeted metabolic profiling studies, which aim to analyze all detectable analytes within multiple experiments [1, 12, 13, 19, 22, 23]. There are two schools of thought regarding how to address the retention time correction in hyphenated mass spectrometry (Figure 1). Algorithms that rely on time domain data only (Figure 1, branches 1 and 3) ignore highly pertinent information contained in the m/z dimension. Accumulated evidence suggests that effective retention time correction cannot be achieved based on retention time only, regardless of whether the approach involves peak matching or profile alignment [17–19, 22, 26, 28, 29].
Profile alignment approaches that use the full chromatogram data matrices are expected to be the most accurate (Figure 1, branch 4). These approaches, however, come at a high cost, both in terms of complexity and computational costs. For example, Bylund and co-authors used correlation optimised warping , while Baran and co-authors used dynamic time warping with explicitly specified time shifts . In both cases, an arbitrarily chosen "target" chromatogram was used to align all other chromatograms in a pairwise fashion, with the chromatogram data matrices segmented to facilitate the alignment [27, 30]. This raises several difficulties. For example, it is unclear how the choice of the target chromatogram may affect the final alignment. In principle, a more objective alignment of chromatogram data matrices could be achieved by calculating a similarity tree [32, 33]. This, however, must involve all possible pairwise alignments, which is likely to be computationally expensive. Furthermore, in the reported examples the chromatogram segmenting was based on strategic or node positions influenced by user chosen parameters , or a "representative" set of peaks . This in turn raises the question of how the segmenting method might affect the final alignment. Finally, these approaches handle a large amount of uninformative noise data, since only a small portion of the full chromatogram data matrix is informative signal.
Peak matching algorithms that use both time domain data and mass spectra are particularly promising, and have been applied to both LC-MS [20, 21] and more recently to GC-MS data . The input for these algorithms are signal peaks extracted from full chromatogram data matrices. Since the majority of uninformative data is discarded in the peak detection step, these methods operate on a vastly reduced data set while retaining highly selective information contained in the mass spectra. Here we propose an approach that falls into this category, and uses signal peak "objects" which are signals extracted from the chromatogram data matrix. A peak object may be characterized with several attributes including peak retention time (the time taken at the peak apex), peak mass spectrum (the m/z vector taken at the peak apex), experiment/cell state information, a unique peak ID, and so on. The result of an experiment is a list of peak objects, henceforth referred to as a peak list. From this viewpoint the peak matching problem is reduced to the alignment of peaks between multiple peak lists. If we assume that the elution order of peaks is conserved (a reasonable assumption for GC-MS; also an assumption widely used in proteomics based LC-MS ), this problem shows resemblance to extensively studied problem of multiple sequence alignment [32, 33]. In order to cope with rapidly escalating computational costs of an exact, multidimensional dynamic programming solution, efficient algorithms were developed for multiple sequence alignment [33, 35]. We have adapted this approach to the problem of peak alignment in multiple GC-MS experiments.
The method proposed here uses both peak retention times and mass spectra, and relies on dynamic programming to find the optimal solution to the global alignment problem. Any peak matching method that uses both similarity in retention times and mass spectra similarity depends on a balance between the two, and devising an approach that balances this correctly poses a considerable challenge. For example, it is possible to incorporate mass spectra similarity into the total peak similarity score used in progressive hierarchical clustering , but it is unclear how to weight the relative contributions of retention time and mass spectra. The problem here is that progressive clustering relies on a single cutoff of the dendrogram tree to delineate peak clusters (essentially, the cutoff is a constant across the entire data set), while the retention times drifts are highly non-linear (Figure 2 and ). In the work of Styczynski et al, the peak matching approach used to find conserved metabolites in GC-MS metabolic profiling experiments incorporated both the similarities in mass spectra and peak retention times . However, in the demonstration of this approach the retention time similarity was taken into account only coarsely, with an elution similarity threshold of 1 min . When two metabolites have distinct mass spectra, the retention time information can be neglected altogether and correct peak matching can still be achieved. However, for metabolites that elute in close proximity to one another, and give similar fragmentation patterns, the information provided by retention times is critical for a correct matching. Therefore, we would expect the method of Styczynski and co-authors to have difficulty when metabolites with similar fragmentation patterns elute in close proximity, as demonstrated by their inability to resolve isoleucine and leucine in the test data set .
We propose the peak similarity function that incorporates the similarity between peak mass spectra modulated by the similarity in peak retention times (Equation ). This function is governed by two parameters: the gap penalty function (G) and the retention time tolerance (D). The first parameter (G) determines how similar mass spectra must be for peaks to be considered to represent the same metabolite; the second parameter (D) is related to expected drifts in retention times between experiments. For example, if retention times in a particular set of experiments are highly reproducible, decreasing the retention time tolerance will enable this information to be leveraged for increased accuracy in the peak alignment.
To achieve the alignment of multiple peak lists, we rely on progressive alignment based on a similarity tree. The similarity tree is calculated from pairwise alignments, and the global alignment is built progressively, starting from the two most similar peak lists, and joining other peak lists in a process guided by the similarity tree [32, 33]. When several cell states with multiple replicate experiments per state are analyzed, within each cell state one deals with true experimental replicates, while in experiments performed on different cell states some metabolites may be missing altogether in one state relative to another. To accommodate for this complexity, we first perform a within-state alignment, followed by the between-state alignment built by aligning the within-state alignments. In the case of more than two sets of replicate experiments, a similarity tree is built based on pairwise similarities between fixed within-state alignments, and then the alignments themselves are aligned progressively following the similarity tree. Therefore multiple guide trees are built, one for each set of replicate experiments to facilitate within-state alignment. An additional guide tree may be built to facilitate between-state alignment if more than two states are present.
Several methods for peak matching based on both retention times and mass spectra (Figure 1, branch 2) have been described recently [20, 21, 23]. Of these only the approach of Styczynski and co-authors was developed on GC-MS data , while the software packages MZmine  and XCMS  were developed on LC-MS data. In our experience the latter software packages tend to overinterpret GC-MS data, assigning a greater number of peaks than expected (probably related to differences in fragmentation patterns between GC-MS and LC-MS). Nevertheless, the alignment algorithms implemented in MZmine and XCMS are relevant and we briefly review them here.
MZmine uses a simple alignment method to build the peak alignment table: one peak at a time is taken and an attempt is made to match it to an existing row of the peak alignment table. If no rows match a new row is created . A secondary peak detection method is used to fill the gaps in the resulting alignment table . We expect this approach to exhibit limitations similar to those observed in hierarchical clustering , and discussed above.
XCMS incorporates one of the most advanced peak matching algorithms for metabolite profiling data described to date . In this approach the distribution of peaks along the time domain by using the kernel density estimator is calculated, and regions where many peaks have similar retention times are identified . From this, a fixed time interval that determines each group of peaks is deduced . In practice, the retention time drifts are distributed in a highly irregular fashion, and a fixed time interval is unlikely to be able to capture peak groups correctly. This is suggested by observed peak collisions, where more than one peak from the same experiment is joined into a single group . The same problem was encountered in other peak matching methods [19, 22], and requires additional, empirical intervention, such as "collision resolution"  or "tie-breaking" in XCMS . The approach we propose inherently prevents peak collisions. Furthermore, it does not rely on any fixed intervals to group peaks, nor does it rely on any assumptions about the distribution of peaks across the samples.
It is of interest to compare the method proposed here to the approach for constructing signal maps, described recently in LC-MS proteomics experiments . Underlying the calculation of signal maps is the method for optimal alignment of chromatogram data matrices, that can be classified as belonging to branch 4 in Figure 1. These authors used the Needleman-Wunsch algorithm to align full chromatogram data matrices obtained from LC-MS proteomics experiments . They also used minimum spanning tree, or progressive merging of pairwise alignments into a consensus run, to produce a global signal map . Apart from the overt difference in that we focus on extracted signals rather than using the full chromatogram data matrices, there are several important yet more subtle differences between the two methods. First, in contrast to the method of  the score function proposed here incorporates both similarities in retention times and mass spectra. Second, in constructing signal maps from peptide mass spectra Prakash and co-workers allow for multiple mass spectra in one experiment to correspond to a single mass spectrum in the other experiment . Since we are aligning peak objects rather than the raw signal, our approach inherently allows only one-to-one peak matching, and we consider explicitly the question of gaps (a match-to-nothing). Both of these features are critical for the ability to achieve the alignment of signal peaks in GC-MS. Similarly as in the approach of , we rely on the two-dimensional formulation of the global sequence alignment problem. However, during progressive alignment of individual experiments we do not merge individual runs into a "consensus" run, because this has a potential to degrade signal if two unrelated signal peaks are merged. Rather, our approach is based on the generalized solution for the alignment of two alignments, i.e. an N- and M-alignment. This has an additional benefit to be directly extensible to the alignment of pre-computed alignments representing replicates of different cell states (where signal-to-nothing matches may be significant), and allows one to tackle the problem of an arbitrary number of cell states with the same conceptual framework and finite computational resources. We note that the modification proposed for the calculation of mass spectra similarity , and results from other proteomics studies , could be used in the method proposed here.
It would be useful to compare more directly the performance and accuracy of the approach proposed here to the approaches for peak matching described previously. Unfortunately, this is currently not feasible for two reasons. Firstly, a standard data set which could be used as a benchmark does not exist. Furthermore, it is difficult to create such a data set ad hoc. Most other alignment methods of interest, such as methods implemented in software packages MZmine  or XCMS , are embedded in the multi-step processing pipelines, and the input to peak matching is generated directly from the output of package-specific peak detection. Secondly, a suitable metric to quantify differences between two methods does not exist. This is particularly problematic, because in even the simplest case the resulting alignment table may involve hundreds of peak entries, and it is not clear how to represent and quantify the differences between two such tables.
The development of a suitable metric to compare two peak matching methods is likely to require a separate research effort. To circumvent this we have performed a detailed analysis of absolute error and compiled error statistics relative to the "correct" alignment table created manually (Figures 6 and 7). The error analysis performed on experimental data consisting of eight replicate GC-MS experiments collected on Leishmania parasites with ~170 peaks per experiment showed that, for near optimal parameters, <5 metabolites were affected by misalignment errors. This approaches the accuracy achieved in a manual alignment, which required tens of man-hours for the tested data set.
The alignment accuracy was not overly sensitive to empirical parameters for a wide range of values (gap penalty and retention time tolerance), suggesting that the method is robust. To address the problem of a benchmark data set, we provide our test data set in the supplementary material (see Additional files 1, 2, 3). This includes input peak lists, raw chromatogram data matrices, and alignment tables, including those produced for optimal parameters, as well as the "correct" alignment table prepared manually.
The main drawback of the proposed approach is that it operates on a set of signal peaks obtained from peak detection pre-processing. Automated peak detection remains a challenge , and any errors introduced during peak detection (such as missing peaks or false peaks) will propagate through to the alignment tables. This is inherent in all peak matching methods, and could be viewed as an advantage as well. Focusing on signal peaks results in a greatly reduced data set which contains the vast majority of interesting signal, and allows one to leverage this information for downstream processing , in addition to significantly reducing computational costs required for downstream processing.