Volatile sample collection and analysis
Volatile compound sampling
Volatile compound sampling protocols (sorbent choice and sampling method) are specific to analyte identity and sample source, and vary widely depending on the research area and focus. The majority of our sampling has employed the polydimethylsiloxane (PDMS)-based TwisterTM (GERSTEL, Inc.) because of its high capacity, versatility (both headspace and stir-bar sorptive extraction modes possible) and ease-of-handling in field settings (Figure 1A). Volatile compounds captured by the TwisterTM are thermally desorbed for analysis (Figure 1B). Although TwistersTM have been our primary sorbent to date, other sorbent types and volatile sampling methods (e.g., packed cartridge, SPME, direct headspace injections and direct thermal desorption) can be used and are compatible with data annotation and Bin databasing.
Retention index markers
Absolute retention times (RT) of GC-MS peaks shift as a function of column properties (e.g., column type, age, length, phase ratio, film thickness) and RT differences are frequently observed among samples or sample types (Figure 1C). When performing large studies spanning months or years, or comparing many different sample types, RT shifts are unavoidable. Retention indices (RI) overcome this problem by locking the retention times of eluted compounds to fixed positions defined by marker compounds spiked into the sample. Highly different samples can be compiled in a database over years with the use of RI markers.
The vocBinBase algorithm requires the addition of RI marker compounds to all samples for RI corrections. We use fatty acid methyl esters (FAMEs) as RI markers rather than classic straight-chain alkanes (Kovats RI) because FAMEs exhibit electron ionization (EI) fragment patterns (especially at high m/z values) better suited for unambiguous and automated detection. To avoid confusion between the FAME-based RI values and Kovats-based RI values (carbon number * 100), we have adopted a distinctive unit value and FAME RI values range from 262,214 for FAME C4 to 980,934 for FAME C24. For reference, the corresponding alkane-based RI values for FAMEs C4 and C24 are 726 and 2712, respectively. Both FAMEs and alkanes are naturally occurring volatiles [8], so the addition of the RI mixture will prevent the detection of the specific marker compounds added unless isotopically labeled RI markers are used.
The RI mixture for volatile samples includes FAMEs of linear carbon chain lengths C4, C6, C8, C9, C10, C12, C14, C16, C18, C20, C22, and C24. A stock mixture is prepared in methylene chloride with final FAME concentrations of 5 mg/mL (C4), 1.5 mg/mL (C20, C22, C24), 1.2 mg/mL (C6, C8), 0.8 mg/mL (C9, C16, C18) and 0.4 mg/mL (C14-C18). This FAME stock solution is then diluted 200-fold in methyl propionate prior to use. The working FAME RI mixture is introduced externally to the Twister™ in 0.5 uL capillaries. Capillaries are filled with the FAME RI solution and then placed alongside the Twister™ in a frit-bottomed TDU transport tube for thermal desorption (Figure 1B). Chromatograms illustrating the grid-like nature of the FAME RI markers in a citrus leaf volatile sample spiked using the capillary method are shown below (Figure 1D).
Instrumentation
Volatile sample analyses are performed on a 6890 GC (Agilent Technologies, Santa Clara, CA) equipped with a thermal desorption unit (TDU, GERSTEL, Inc., Muehlheim, Germany), cryo-cooled injection system inlet (CIS4, GERSTEL, Inc.) and robotic sampler (MPS2, GERSTEL, Inc.) interfaced to the Pegasus IV time-of-flight mass spectrometer (Leco, St. Joseph, MI).
Thermal desorption and injector parameters
Exposed Twisters are thermally desorbed in the TDU in splitless mode (50 mL/min flow rate, solvent vent mode) at an initial temperature of 30°C, ramped to 250°C at a rate of 12°C/sec, and then held at the final temperature for 3 min. The desorbed analytes are cryofocused in the CIS4 inlet with liquid nitrogen (-120°C). After desorption the inlet is heated from -120 to 260°C at a rate of 12°C/s and held at 260°C for 3 min.
GC-TOF-MS settings
GC-TOF-MS instrument settings and programming are defined in standard operating procedures in order to produce data that can be auto-annotated and compiled across studies. Chromatographic separation is performed on an Rtx-5SilMS column with a 10 m integrated guard column [95% dimethyl/5% diphenyl polysiloxane film; 30 m × 0.25 mm (inside diameter) × 0.25 μm d.f. (Restek, Bellefonte, PA)]. The GC oven temperature program is as follows: initial temperature of 45°C with a 2 min hold followed by a 20 °C/min ramp up to 300°C with a 2 min hold followed by a 20 °C/min ramp up to 330°C with a 0.5 min hold. The carrier gas (99.9999% He) flow is held constant at 1 mL/min. The transfer line temperature between the gas chromatograph and mass spectrometer is 280°C. Mass spectra are acquired at 25 spectra/sec with a mass range of 35-500 m/z. The detector voltage is set at 1800 V and the ionization energy at 70 eV. The ion source temperature is 250°C.
Binbase database construction
Database structure
The BinBase code was developed in Java and Groovy, and is based entirely on open-source software. BinBase employs multilayered software architecture (Figure 2). At the core of BinBase is an SQL-conforming database which stores mass spectra (generated during sample analysis), analysis results and cached data (for improved speed). Database contents are accessed by the cluster, application server and Bellerophon using Java Database Connectivity (JDBC). This access is encapsulated by Enterprise JavaBeans (EJB) and the Hibernate Object mapping framework. The BinBase central configuration is stored in the Application Server, which also houses EJB, WSDL (Web Service Description Language)-based services, JMS (Java Messaging Service), and JMX (Java Management Extensions) components; together these comprise the BinBase Communication Interface (BCI). These EJBs provide an interface to the database and allow other Java programs to access the database, query data and start calculations in a defined, restricted manner. The Hibernate persistence and object mapping layer allows for execution of complex queries in a simple, intuitive way and is primarily used by Bellerophon, the BinBase administration graphical user interface (GUI) (see below). A WSDL service layer was added to overcome EJB limitations so that BinBase can be accessed from most programming languages. Internally, the WSDL service layer is also used for all web front-ends and communications with SetupX/MiniX. JMX components are used to configure the whole system at a central location and monitor system properties. The BCI module plays a key role in system security by limiting user access to particular services based on IP address and password, and by preventing denial of service (DoS) attacks or SQL injection attacks.
BinBase database installation requirements
The BinBase system requires a Rocks Linux cluster-based architecture to calculate mass spectral data. This is minimally established with a system consisting of two standard personal computers (PC's). The first PC stores data (*.netcdf files,*.txt files and database content), provides access to web pages and maintains the calculation queue. The second PC performs calculations. A dual core 2 GHz central processing unit (CPU) and 4 GB RAM are sufficient for each of these PC's if the calculation load does not exceed several hundred samples a day. Because of its data storage function, the first PC requires 1-2 TB storage and two 1 GB network cards. A smaller hard drive (200 GB) and a single network card are sufficient for the second PC. Our current configuration at the Genome Cente' each and one head node with a solid state disk-based storage array for improved database access.
The BinBase database is available to the public under the LGPL 2.0 license (http://binbase.sourceforge.net), and is accessible using different web front-ends and rich client applications as well as a webservice layer. Documentation required for installation and administration of the system is also found at this website.
Bellerophon
The front-end graphical user interface (GUI) Bellerophon is the central administration tool for BinBase and is used for Bin management, database browsing and retention index configuration. Bellerophon is an Eclipse 3 SWT-based rich client platform (RCP) application. It includes visualization capabilities based on JFreeChart and supports database queries via a Hibernate framework. The Hibernate framework supports mapping database tables to objects. Dynamic SWT-tables and visualizations are created from these objects via Java Reflection-API and XDoclet.
SetupX
SetupX is a study design database whose primary functions include capturing experimental metadata for class generation, randomizing and scheduling GC-TOF-MS sequences, and storing annotated GC-TOF-MS data along with all other data files connected to an experiment (e.g., photographs, assay spreadsheets, other instrumental data files). Details regarding SetupX structure have been described [35, 37]. We have developed a leaner version of this database, MiniX. User requests for BinBase annotations through the MiniX website activate the MiniX BinBase export function by EJB and JMS. BinBase additionally requests experimental class information from MiniX through EJBs. MiniX is an open source project and can be downloaded and installed under the LGPL 2.0 license (http://code.google.com/p/minix/).
vocBinBase filtering algorithm
The vocBinBase algorithm takes the deconvoluted spectra and metadata provided by the Leco ChromaTOF software as well as sample information from the study design database SetupX/MiniX and applies a multi-tiered filtering system that either annotates spectra to existing database entries ('Bins'), creates and adds new Bins to the database if all quality criteria are met, or discards low-quality spectra to maintain database integrity (see Additional File 1, figure S1). Each database entry or "Bin" represents a unique compound that has matched all mass spectral, instrumental and class metadata thresholds. Bins are minimally defined by the following properties: mass spectrum, retention index (RI), quantification mass, list of unique masses, and a unique identifier number.
Data preprocessing
Raw data are pre-processed by the Leco ChromaTOF software and stored as ChromaTOF-specific *.peg files, generic *.txt results, and as generic ANDI MS *.cdf files. ChromaTOF (v. 2.32) data processing parameters specified in pre-processing steps include baseline setting just above noise (value = 1), no smoothing, and signal-to-noise ratio minimum of 20. The *.txt files are exported to a file server for further processing by the algorithm. The vocBinBase algorithm is compatible with ChromaTOF software versions 2.32 to the current version, 4.33.
Spectral validation
After importing all deconvoluted spectra of all chromatograms of a biological study (*.csv format), spectra are checked for the presence and abundance of the unique ion (relative to the base peak), the presence of all apex masses (masses that share the maximum intensity with the peak maximum of the unique ion), and for the number of peaks that exceed apex intensity thresholds. Spectral validation is the first data quality filter; chromatograms with overloaded peaks and deconvolution errors are used only for peak matching, but not for Bin generation.
Retention index calculations based on fatty acid methyl esters
The BinBase algorithm for retention index correction first applies a base peak filter to all spectra to locate the FAME RI markers (no retention time information is used). From this filtered list, the FAME peak with the highest mass spectral similarity score is used as the reference point from which distance measures are applied to higher and lower retention times to locate all other RI markers. Once all the required FAME markers are found, a correction curve is calculated using a linear regression for the first two and last two standards and a polynomial regression of the fifth-order for the standards in between. The polynomial regression is applied within the calibrated range to account for the absolute and relative retention time shifts, which differ from linear regressions at early and at late retention times. As high-degree polynomials perform poorly at extrapolating, linear regression is used to extrapolate outside the RI marker range. In the event that not all early- and late-eluting RI markers are found, the generation of new Bins is disabled, but matching existing Bins is still viable.
Parameters used to find the RI markers for volatile samples required substantial modification from those used in the metabolite algorithms. Match settings and base peak patterns had to be redefined to accommodate the extension of the FAMEs to include C4 and C6, as well as the change in the m/z range from 85-500 to 35-500. This extension of the m/z range to lower values is absolutely required for the volatile compounds, as they are not TMS-derivatized and the 35-85 m/z range provides important fragment data to aid in compound identification. To avoid losing high quality data in which FAMEs were not in specification, existing algorithms were modified to allow for the application of a correction curve of a previous or later sample acquired on the same day to the sample in question. If no such valid RI data were found, search windows were extended up to ten days; otherwise, a partial curve is generated using the RI markers found in the solitary sample. In all of these cases, Bin generation is disabled, but all existing Bins are assigned.
Peak annotation by the BinBase algorithm
The ChromaTOF metadata used in peak annotation by the BinBase algorithm include mass spectral similarity, peak purity (an estimate of the number, proximity and similarity of co-eluting peaks), retention index, signal-to-noise ratio, unique ion, apex ions and unique mass-to-base peak ratio. Additional metadata reported by the ChromaTOF software (e.g. peak height, area %) are not used by the algorithm. Following RI correction (described above), spectra are sequentially annotated by decreasing peak intensity. For a given peak, the algorithm sets an RI window (± 2,000 FAME RI units, ~2 sec) and uses a unique ion match filter to match either the unique ion or apexing ions of the deconvoluted peak to generate a list of possible Bin assignments. With just these two parameters, a high degree of filtering is achieved. For example, a compound with a FAME RI value of 446700 and the unique ion m/z 93, the RI filter constraints reduce the number of mass spectra comparisons from 1,537 entries to eight potential hits. The unique ion constraint further reduces possible Bin matches from eight hits to two candidates [terpinolene (monocyclic terpene) or linalool (linear terpene alcohol)] (Figure 3). Only at this stage is a mass spectral similarity filter applied, which uses variable thresholds based on peak signal-to-noise ratio and peak purity. An abundant, well-resolved peak requires a higher mass spectral similarity score for successful annotation than a small or co-eluting peak.
In effect, different thresholds for each parameter can be defined for different peaks. In the example illustrated above (Figure 3), the peak is reasonably pure (peak purity = 0.1137) and a high mass spectral similarity score is required for Bin matching. Based on these final filtering criteria and the mass spectral similarity scores for linalool (917) and terpinolene (<500), the final compound assignment in this example is linalool. In this particular example, there are, in fact, three Bins within the ± 2000 FAME RI unit window, two which have a unique ion value of m/z 93. This second Bin with the unique ion m/z 93 is, in fact, terpinolene.
At this stage in the annotation, more than one Bin assignment may remain (e.g., stereoisomers that might elute within the search RI window). The isomer with the closest matching RI is then annotated, unless an alternate Bin has a significantly greater similarity score. Spectra that are filtered out in the isomer filter might still be able to match other neighboring Bins and are therefore fed back into the annotation algorithm.
New Bin generation - tracking unknown compounds
In the event the spectrum does not match an existing Bin, the BinBase algorithm generates a new Bin if specific, highly stringent criteria are met. First, the spectrum in question must pass strict mass spectral quality thresholds based on purity (purity value < 1.0) and intensity (S/N > 25). Thresholds for the Bin-generating mass spectral filter are more stringent than those for the similarity filter to ensure that only abundant and pure spectra become new Bins. Second, a potential new Bin must pass an experimental class filter before being validated. This filter demands that a new Bin is detected in at least 80% of all samples of an experimental class in order to ensure its identity as a genuine volatile and not a spurious contaminant. All database Bins were generated by the algorithm as described from data collected in laboratory and field experiments.
Post-matching and replacements
Once all spectra of all experimental classes have been annotated, a comprehensive Bin list including all Bins found across the experiment is compiled. Then all spectra are again matched against the Bin list (post-matching) in order that all Bins, including any newly-generated Bins, are searched in all samples. In this step, spectra in samples which did not pass the more stringent MS thresholds required for Bin generation may pass the thresholds required for Bin annotation.
In some cases a Bin is not positively detected in all chromatograms either because it is absent or is low abundant (true negative), or it is present but the quality criteria are not sufficient to allow assignment (false negative). This would result in a zero value in the data matrix, which hampers subsequent statistical analyses. A strategy has been devised and programmed into the algorithm to calculate a replacement value in these cases. First the algorithm determines the average retention time for each metabolite over the analytical sequence by calculating the average retention index for the samples and transforming it back to the retention time using the retention index correction curve. Next the raw, unprocessed chromatograms (netCDF or ANDI MS file formats) are opened and the maximum ion intensity at the select quantification ion trace for each missing volatile compound at ±2s around the target retention time is reported minus the local background noise for that target ion at ±5s around the target retention time. The background subtracted ion intensity is reported in the result table with color coding to indicate the results as a 'second-pass' assignment. Validation of the replacement algorithm was performed by comparing manual annotations of replaced values in sample sets with their algorithm replacement values.
vocBinBase Report
All Bins detected in at least 80% of an experimental class are included in the result report folder. Additionally, the report folder contains a result file for all Bins detected in at least 50% of an experimental class. The 50% result can be used by researchers to complement the 80% dataset with more identified metabolites or to evaluate the less confidently found or rare peaks. Each entry in the exported Bin table is reported as the intensity of the Bin quantifier mass, which is by default the unique ion, though this value can be changed manually to any ion in the spectrum by the database administrator. We use peak heights and not peak areas for several reasons. Peak heights are preferable to peak areas for small peaks, because baseline settings impact peak areas more for small peaks than for larger peaks. Additionally, peak heights based on defined unique ions provide a more stable measure than other parameters such as dTIC or TIC, because for analyzing a given compound in different chromatograms, the number and hence, the combined intensity of detected ions will differ, depending on the peak abundance and purity.
All Bins exported by the vocBinBase database are reported with a unique database identifier, the quantification ion, the retention index value, and the complete mass spectrum encoded as a string (Figure 4). Database entries are named using the Adams plant volatile library (described below). Compounds that are not plant-derived including pesticides, plasticizers and other contaminants are annotated using the NIST-RI library. Known artifacts related to column bleed are annotated in vocBinBase, but are not exported to users in result reports (m/z 207, 221, 281, 355). Database administrators can manually exclude (or include) peaks in the list of reported Bins. For example, Twister™-based artifacts are manually selected for exclusion in results tables. Result data sheets are produced as XLS and TXT formats (or XML if needed). Once identified, Bins are also reported with their chemical name and PubChem identifier.
Bin Identification
Bin identification is supported by the Adams library of mass spectra and retention index data for over 2,000 purified plant volatiles and essential oil components [30], verified for many compounds using authentic standards in our laboratory. Prior to uploading the Adams library into Bellerophon for Bin matching the library was converted from HP Chemstation format to NIST library format by the Lib2NIST download available at the NIST website (http://chemdata.nist.gov). Additionally, the alkane-based Adams RI values were converted to their BinBase FAME RI equivalent. The RI conversion between the Adams and Fiehn chromatographic variants (different GC oven temperature programming and column manufacturer) was accomplished with a 2nd-order polynomial and are given at http://fiehnlab.ucdavis.edu/projects/VocBinBase/. All identified volatiles in vocBinBase are annotated with PubChem chemical identifiers and structure-encoding InChI hash keys to enable cross-references to chemistry databases and structural information tools.
The quality of the RI conversion was tested by injecting authentic reference standards present in the Adams library under standard operating parameters. A comparison of the calculated values with experimentally determined values for 70 reference compounds yielded a correlation of 0.9995 with a standard error of 3,380 RI units (standard deviation of residual error, RIcalculated-RIexperimental). A comparison of calculated and experimental values for 130 Adams library annotations yielded similar values (r2 = 0.9994, SE = 3,320 RI units). A plot of the absolute RI deviation (RIcalculated-RIexperimental) for the 70 standards and 130 library annotations revealed that 61% of the injected compounds were within one standard error, and 58% of the annotated compounds fell within one standard error of the calculated value. See Additional File 2, figure S2 for the graphed data.
Database contents
At present the database contains spectra from 3,435 samples representing 18 species. Despite the 1.7 million imported, fully deconvoluted spectra, the vocBinBase database currently only contains 1537 unique Bins. Of all imported spectra, 45% fail to meet algorithm thresholds and are discarded; such spectra are noisy and inconsistent. The lower users set thresholds for peak detections in ChromaTOF (e.g., lowering peak finding criteria from s/n>20 to s/n>3), the more peaks would be detected. Most of the corresponding peak spectra would be discarded by the BinBase algorithm as too noisy and not be reported in output sheets. A similar rate of discarding spectra was reported by the SpectConnect tool [25] that employs AMDIS deconvolution data [24] of GC-quadrupole MS instruments. Under the settings used here, the remaining 55% of the spectra meet the quality criteria and are annotated and stored in the database (Figure 5). Approximately 12% of the annotated compounds are column- and Twister™-derived polysiloxane artifacts; these artifacts are annotated by the algorithm but are not included in the BinBase reports exported for users. As described above, annotations rely on multiple criteria and certain thresholds are variable depending on various metadata values; the required MS similarity threshold depends on peak abundance and purity (e.g. a low purity peak requires a less stringent MS similarity match). A small percentage of annotated spectra (4%) are generated by very pure peaks (purity <0.15) with high MS similarity score, while the majority of database entries are generated by pure peaks (purity<1.5, 46%) or not pure peaks (purity>1.5, 39%).
Of the current 1,537 Bins, 211 have been identified as genuine volatiles through mass spectral-retention index matching. In addition, 161 Bins were annotated as polysiloxane artifacts (which therefore do not get exported into study result data sheets), and the remaining Bins are unidentified yet. Visualization of the VOC database contents using spectral similarity (all Bins) and the Tanimoto chemical similarity coefficient (identified Bins) was performed using Cytoscape (Figure 6). The Tanimoto similarity coefficient is a similarity metric that calculates a score indicating the level of similarity between molecules being compared [38]. The network overview provides a visual representation of the relationships between the 1537 Bins. The identified compounds are represented by red nodes and the unidentified compounds as grey nodes. Nodes clustered closely together are more similar than those nodes with just a single connection at the edge of the network. Blues edges link identified volatiles with structural similarity greater than 700. Note that the polysiloxane artifacts cluster away from the compounds, due to very distinctive fragmentation pattern. Network regions with identified compounds (red nodes) have been labeled with class information.