Visualizing high and low quality features
When working with untargeted LC-MS data, visualization of extracted ion chromatograms (EIC) of features can be used to optimize peak detection, peak quantification, and biomarker discovery [8, 15, 16]. We propose randomly sampling several hundred EICs after peak detection and quantification to visualize peak morphology and integration. The EICs can then be classified by the user as “high” or “low” quality (see Fig. 2). A high quality peak has good morphology (e.g., is bell-shaped, although this is not a necessary condition), the correct region of integration across all samples, and proper retention time alignment. Such visualization is made easy with plotting functions from peak detection software such as the ’highlightChromPeaks’ function within XCMS [8]. In almost all cases, we find the distinction between high and low quality peaks to be clear, but when peaks are ambiguous we make the conservative choice to classify them as low quality. Once features are classified as high or low quality, their characteristics across samples such as average blank and biological sample abundance, percent missing, and ICC can be compared and used to perform feature filtering. While classification of high and low quality peaks is a time intensive step, we have found that visualization and inspection of hundreds of features takes between 1–2 h and greatly improves the ability to uncover biological variability in the data. Moreover, after feature visualization, executing the remaining steps of the filtering pipeline requires no more than 1 h.
Data-adaptive feature filtering
Example datasets
To help present and visualize our data-adaptive feature filtering methods, we introduce an untargeted LC-HRMS dataset generated in our laboratory on a platform consisting of an Agilent 1100 series LC coupled to an Agilent 6550 QToF mass spectrometer. The dataset contains the metabolomes of 36 serum samples from incident colorectal cancer (CRC) case-control pairs as described in [16, 17]. Over 21,000 features were detected in the 36 serum samples that were analyzed in one batch [16, 17]. We randomly sampled over 900 features from the dataset and classified these as “high” or “low” quality according to their peak morphology and integration quality. To demonstrate the performance of our data-adaptive pipeline, we split the known high and low quality features into a training set (60%) and a test set (40%). Features in the training set were used to visualize appropriate, data-dependent cutoffs, whereas features in the test set were used to evaluate the effectiveness of the selected cutoffs. At each stage of the data-adaptive filtering, we compared our method to more traditional filtering methods by examining the proportions of high and low quality features in the test set that were removed. An application of the data-adaptive pipeline to another untargeted LC-HRMS metabolomics dataset generated in our laboratory can be found in Additional file 1. This additional dataset represents the metabolomes of 4.7-mm punches from archived neonatal blood spots (NBS) of 309 incident case subjects that were obtained for the California Childhood Leukemia Study [15, 18]. For the sake of clarity, we do not include results for this second dataset in the main text, and the results can be found instead in Additional file 1.
We also visualized and classified over 200 features in each of two public LC-MS datasets. One of the public datasets was generated on a platform consisting of an Accela liquid chromatographic system (Thermo Fisher Scientific, Villebon-sur-Yvette, France) coupled to an LTQ-Orbitrap Discovery (Thermo Fisher Scientific, Villebon-sur-Yvette, France). This dataset contains the metabolomes of 189 human urine samples analyzed in negative mode. We took a subset of 45 of the urine samples in the first batch, along with 14 pooled QC samples and 5 blank samples. We processed this dataset using the original xcms functions and parameters used by the authors (W4M00002_Sacurine-comprehensive) [10, 19]. The second public dataset was generated on a platform consisting of an Accela II HPLC system (Thermo Fisher Scientific, Bremen, Germany) coupled to an Exactive Orbitrap mass spectrometer (Thermo Fisher Scientific) [20]. This dataset contains the metabolomes of epithelial cell lines treated with low and high concentrations of chloroacetaldehyde. We used all 27 cell line samples in negative mode treated with low concentrations, as well as 6 pooled QC and 11 blank samples. The original work did not use xcms to process the raw data, so we used the R package IPO to determine the xcms parameters [21].
Filtering features based on blank samples
Blank control samples, which are obtained from the solvents and media used to prepare biological samples, can help to pinpoint background features that contribute to technical variation [2, 3, 10, 22, 23]. A common filtering method is to use a fold-change (biological signal/blank signal) cutoff to remove features that are not sufficiently abundant in biological samples [3, 10, 12]. Rarely does the user examine the data to determine a suitable cutoff. We employ a data-adaptive procedure that takes into account the mean abundance of features in blank and biological samples, the difference between mean abundances in blank and biological samples, and the number of blank samples in which each feature is detected. Our method then assigns cutoffs according to the background noise and average level of abundance. If the dataset contains several batches, filtering is performed batch-wise.
We use a mean-difference plot (MD-plot) to visualize the relationship between feature abundances in the blank and biological samples and assess background noise (Fig. 3). Abundances are log transformed prior to all data pre-processing and visualization. The mean log abundances of each feature across biological and blank samples are then calculated and the average of and difference between these two means are then plotted on the x- and y-axes, respectively. The horizontal zero-difference line (blue lines in Fig. 3) represents the cutoff between features having higher mean abundances in the blank samples and those having higher mean abundances in the biological samples. If there are n blank samples in a batch, then n+1 clusters of features will typically be visually identifiable in the MD-plot, where cluster i=0,…,n is composed of features that are detected in i blank samples. For example, because three blank samples per batch were used in the example dataset, four clusters are identifiable in Fig. 3a. Similar clusters can be identified in all datasets generated from our laboratory (See Additional file 1: Figure S1) and in the public datasets. Filtering is then performed separately for each cluster. If a cluster contains no high quality features, as is often the case with clusters that contain lower abundance features, that cluster can be removed entirely.
The cluster corresponding to features detected in all n blank samples tends to have the highest number of features (around 95% of the total number of features), features with higher average abundances, and the highest number of high quality features. Therefore, careful, data-dependent filtering of this cluster is crucial for the success of subsequent analyses. This cluster also has a non-uniform distribution of mean feature abundances (Fig. 3b). This cluster is thus partitioned based on quantiles (20th, 40th, 60th, and 80th percentiles) of the empirical distribution of mean abundances (x-axis). This ensures that each partition has the same number of features and that the features are uniformly distributed throughout the dynamic range. Within each partition, the empirical distribution of abundances below the zero-difference line is used to estimate the technical variation above that line. The absolute value (green lines in Fig. 3b) of an appropriately identified percentile of the negative mean differences (purple lines in Fig. 3b) is used as a cutoff to remove uninformative features. Users may identify appropriate percentiles of the negative mean differences (purple lines) based on how many high quality features would be removed if the absolute values of those percentiles (green lines) were used as cutoffs. We find percentiles between the lower quartile and median to be appropriate for this cluster of features, because they remove as many low quality features as possible without removing high quality ones. Feature filtering in the remaining clusters can be performed in a similar manner, but without the need to partition features based on average abundance.
Using MD-plots to filter features allows for the simultaneous filtering of features by both the difference in abundance in blank and biological samples (y-axis) and average abundance (x-axis). Average abundance of features across biological samples is a commonly used filtering characteristic, but the filtering is often done using pre-specified cutoffs (e.g., lowest forty percent for datasets with more than one thousand features) (Fig. 4a) [10, 12]. Although we advocate for the filtering approach described previously, if users prefer to filter by just average abundance, the MD-plot allows for easy visualization of a data-dependent cutoff that removes as many low quality features as possible without removing high quality ones. The same can be said for identifying a data-adaptive fold-change (biological signal/blank signal) cutoff, rather than using default cutoffs provided in preprocessing workflows (Fig. 4b) [10]. While we recognize that the background signal can modify the biological signal (e.g., via ion suppression), we do not consider this source of variability.
Filtering features by percent missing
As mentioned above, low-abundance metabolomic features tend to have a high proportion of undetected values across samples. In addition, when using software such as XCMS for peak detection and quantification, peaks can be missed by the first round of peak detection and integration. Functions such as ’fillChromPeaks’ in XCMS are often used to integrate signals for samples for which no chromatographic peak was initially detected [8, 12]. Low quality peaks also tend to have higher proportions of missing values after initial peak identification and integration (Fig. 5 and Additional file 1: Figure S2).
To determine the appropriate filtering cutoff for percent missing, we create side-by-side box plots of percent missing values for the high and low quality features classified by visualization of EICs (Fig. 5a). The box plots help to compare the percentiles of the distributions of percent missing values for the high and low quality features, and to select an appropriate cutoff based on these percentiles. Density plots of percent missing values can also be used to visualize the modes and percentiles of the distributions for high and low quality features (Fig. 5b), and cutoffs can be determined based on these distributional properties. For example, appropriate cutoffs would be those that discriminate between the modes of the two distributions, that remove long tails of distributions of low quality features, that correspond to extreme percentiles of one distribution but intermediate percentiles of another, etc. To ensure that we do not remove features that are differentially missing between biological groups of interest (e.g., mostly missing in cases but not controls), we perform a Fisher exact test for each feature, comparing the number of missing and non-missing values against the biological groups of interest. A small p-value for a given feature would indicate that there is a significant dependence between the phenotype of interest and missing values. Features with a percent missing below the identified threshold or with a Fisher exact p-value less than some threshold (we recommend a small value such as the one hundredth percentile of the p-value distribution) are retained. This test of association between the phenotype of interest and missing values can easily be extended to studies where the biological factor of interest is a multilevel categorical variable or a continuous variable by using, for example, a Chi-Square test or a Wilcoxon rank-sum test, respectively.
Filtering features by ICC
High quality and informative features have relatively high variability across subjects (biological samples) and low variability across replicate samples [10, 12] (Fig. 6 and Additional file 1: Figure S3). Typically, the coefficient of variation (CV) is calculated across pooled QC samples for each feature and those with a CV above a predetermined cutoff (e.g., 20–30%) are removed [1, 2, 10, 12, 22]. However, we find that the CV is often a poor predictor of feature quality (Fig. 7 and Additional file 1: Figure S4) because it only assesses variability across technical replicates, without considering biologically meaningful variability across subjects. Instead, we propose examining the proportion of between-subject variation to total variation, otherwise known as the intra-class correlation coefficient (ICC) [24], as a characteristic for filtering. Since the ICC simultaneously considers both technical and biological variability, a large ICC for a given feature indicates that much of the total variation is due to biological variability regardless of the magnitude of the CV.
Our method for estimation of the ICC employs the following random effects model:
$$\begin{array}{@{}rcl@{}} Y_{i,j} = \mu_{j} + b_{i,j} + \epsilon_{i,j,k}, \end{array} $$
(1)
where Yi,j is the abundance of feature j in subject i, μj is the overall mean abundance of feature j, bi,j is a random effect for feature j in subject i, and εi,j,k is a random error for replicate measurement k for feature j in subject i. The ICC is estimated by taking the ratio of the estimated variance of bi,j (between-subject variance) to the estimated variance of bi,j+εi,j,k (total variance). If replicate specimens or LC-MS injections are analyzed for each subject, then application of Eq. 1 is straightforward. However, since metabolomics data are often collected with single measurements of each biospecimen and employ repeated measurements of pooled QC samples to estimate precision, then Eq. 1 can be fit by treating the pooled QC samples as repeated measures from a ’pseudo-subject’. As with percent missing, density plots and box plots of the estimated ICC values for high and low quality features can be compared to determine a data-specific filtering cutoff (Fig. 6). Again, we look to the modes and percentiles of the distributions of the high and low quality features to select an appropriate cutoff that strikes a balance between removing low quality peaks and retaining high quality ones. If multiple batches are involved, the final feature list represents the intersection of features from all batches.