Alignment model
We present a statistical model for the alignment of label-free proteomics data to match peptide features across multiple samples after peak-detection and de-isotoping. Unlike any existing proteomics alignment method, our model has the ability to utilize ion mobility drift time from HDMSE experiments, and product ion spectra from traditional LC-MS/MS Data Dependent Aquisition (DDA), or DIA (MSE or HDMSE) experiments, along with the typical parent ion mass-to-charge ratio and retention time - increasing the individuality of each peptide feature and providing a better alignment. At the time of publication, no open-source proteomics file format was capable of storing ion mobility separation data. In order to allow incorporation of this data into our alignment method, we wrote a small data-processing script to read Waters ‘spectrum.xml’ and ‘finalfrag.csv’ files into a Matlab data frame. The Matlab data frame format, a sample configuation file, a sample queue submission file, and the data processing script are available in Additional files 1, 3, 4 and 5, respectively, and can be easily adapted to incorporate any additional separation dimensions similar to ion mobility drift times and liquid chromatography retention times, including retention times from multidimensional LC.
Model
We adapt a DPGMM [31, 32] by adding sample-specific shift and scale parameters. Gaussian Mixture Models lend themselves well to the problem of proteomics data alignment. Each peptide existing in nature has a theoretical mass-to-charge ratio, retention time, etc. within a specific experimental condition, and is represented in our model as a mixture component. We expect the measurement of a peptide to have the same mass-to-charge ratio, retention time, etc., with two different types of measurement error: systematic error and random error. As with any laboratory experiment, LC-MS/MS data are subject to variability. The LC retention times often shift between runs. Pressure fluctuations, changes in column temperature, column manufacturing differences, and peptide interactions can cause changes in the elution time, and/or the elution order of peptides [13]. The mass-to-charge ratios are also subject to measurement error, albeit to a lesser degree than the LC dimension. We account for systematic error with a global shift and scale. Such a transformation would most likely be the result of variations in LC protocols (total run times), or in the time it takes for the first peptide to elute from the column (gradient delays due to different tubing volumes). The remaining random error is assumed to be a sum of small variations from many independent sources of variation, and therefore have a Gaussian distribution. In addition, Gaussian distributions are closed under linear transformations, allowing straightforward computation of posterior distributions with the addition of the shift and scale parameters. Measurements assigned to the same mixture component, or latent peptide, by the model are considered to be matched.
Seed peptide matches are determined with identified peptide sequences and charge states. To avoid introducing error with incorrect identifications, outliers with respect to mass-to-charge ratio and retention time are discarded. These matches are used to initialize hyperparameters, and remain matched at all iterations of the MCMC. Mixture component assignments are given a Chinese Restaurant Process prior, allowing the addition of a new latent peptide if no suitable match exists. Our model addresses simple linear de-warping of the data, however, any preferred de-warping method may be applied prior to utilizing our algorithm. We first describe the model for peptide-level alignment, and then describe the extension to include productions.
Peptide-level model
A sample-specific linear shift (η
d
) and scale (β
d
) is used for de-warping. In the formulas that follow, samples or datasets are indexed by d, individual measured peptides within a sample are indexed by i, and latent (theoretical) peptides are indexed by j. As shown in Equation 1, we assume that a measured peptide feature, xd,i, is a shifted and scaled noisy measurement of a “true” peptide feature, . Let cd,i be an indicator variable for the latent peptide assignment of measurement xd,i, taking on values j=1…J where J is unbounded.
(1)
(2)
(3)
Where εd,i are the residuals between the measured values (xd,i) and the shifted and scaled latent values (), having a multivariate normal distribution - Equation 2. Let Σ be the covariance of these residuals, and let each latent peptide be a draw from DPGMM mixture component cd,i=1…J having mean and covariance σ, as shown in Equation 3. Note that we make the simplifying assumption of shared covariance across all latent peptides. This would suggest that the measured values of each latent peptide show the same variation across the entire mass-to-charge ratio, retention time, and drift time range. We acknowledge that this is not likely the case, although we find this assumption works well in practice. Conjugate priors are used for all model parameters as follows:
(4)
(5)
(6)
(7)
(8)
Normal priors are assigned to the shift and scale parameters as shown in Equations 4 and 5. The seed matches, for each peptide feature are averaged to generate a list of “implied-identified peptides”. Robust fit linear regression is performed for each dataset using the “implied-identified peptides” as predictors, and the measured identified peptides as response. The resulting intercept is taken as the mean hyperparameter in the shift prior distribution (a
d
). Similarly, the coefficient is taken as the mean hyperparameter in the scale prior distribution (e
d
). The variance parameters on the shift and scale priors (b
d
and f
d
) are set tightly to the variance of the regression estimate. This allows an optimal solution to be reached as latent peptides are updated and added, while reducing shift and scale identifiability issues. Both the match covariance (Σ), and latent peptide covariance (σ) matrices are given conjugate inverse-Wishart priors as shown in Equations 6 and 8. The residuals of the shifted and scaled identified peptide measurements, and their respective “implied-identified peptides” are used set hyperparameters. The degrees of freedom parameters (h and t) are set to the number of identified matches minus one, and the inverse-scale matrix is set to the sum of squared residuals. The mean of each latent peptide (μ
j
) is given a conjugate normal prior with as shown in Equation 7. The prior mean (λ) is set to the empirical mean of all measured peptide features in all datasets, and the prior covariance (r) is set to the sum of squared differences between this empirical mean and all measured data. We express the likelihood of xd,i as follows:
(9)
Where z1...z
J
indicates all existing latent peptides. We may also integrate out and re-express the likelihood as:
(10)
The prior probability of an observation, xd,i, being assigned to latent peptide component j given all other assignment indicators, c−(d,i), is given in Equation 11. The notation c−(d,i) refers to all component indicators from all features in all datasets, except d,i. Similarly the prior probability of observation xd,i being assigned to a new latent peptide component is shown in Equation 12.
(11)
(12)
Let α be the DPGMM concentration parameter, N the total number of observed peptide features across all samples, and n−(d,i) the number of observed peptide features other than xd,i assigned to latent peptide j. A constraint is imposed such that only one measurement per dataset may be assigned to a given latent peptide feature. The concentration parameter of the Dirichlet Process (α) is set to the number of peptide feature observations across all samples being aligned. Latent peptide feature assignments are updated from their full conditional posterior distributions, as shown in Equation 13.
(13)
The integral above is tractable due to the shared covariance across latent peptide components. Further details and full conditional distributions are available in Additional file 1. Panel A of Figure 4 shows a plate diagram of the peptide level alignment model.
Production model extension
To incorporate production data, we select up to the 50 most intense productions for each peptide feature measurement, xd,i. We then generate a K-dimensional production intensity profile for each xd,i. Each position, , in the product ion intensity profile, yd,i, is computed as:
(14)
where k=1…K, p=1…50, Ω is a 50-dimensional vector of intensities, M is a 50-dimensional vector of product ion mass-to-charge ratios, and B is a K-dimensional vector of product ion profile mass-to-charge ratio bin upper limits. All values in yd,i sum to one. The mass-to-charge ratio ranges, or bins (B
k
), are determined at the initialization of the alignment, such that bin boundaries fall on mass-to-charge ratio “deserts”. See the Parellelization section and Figure 5 for a detailed description of bin boundary determination. In the experiments described here, we set K = 250 to ensure most product ions would be assigned to their own bin in the product ion profile, and to avoid additional computational complexity. Each existing latent peptide feature is given a K-dimensional product ion profile (w
j
for latent peptide feature z
j
). To assess the similarity of a measured product ion profile, yd,i, and a latent production profile, , we introduce a similarity score, ψ, which is computed as the sum of squared differences of the two product ion intensity profiles, and is assumed to have an exponential distribution to encourage distances close to zero, as shown in Equations 15 and 16.
(15)
(16)
We assign a conjugate gamma prior to the rate parameter, as shown in Equation 17. The hyperparameters for profile scores are set to one.
(17)
At each iteration of the MCMC, the product ion profile, w
j
of an existing latent peptide is updated empirically - the product ion profile is set to the average of the measured product profiles assigned to that latent peptide. The latent product ion profile, w0, of a new latent peptide (one that currently does not exist) is a blank profile - a uniform vector of size K with each element having value 1/K. Combining the product ion model with the peptide-level model, we have the following likelihood (Equation 18) and conditional posterior (Equation 19):
(18)
(19)
Further details and full conditional distributions are available in Additional file 1. We explored additional values of K, as well as implementations of different product ion models, the results and discussions of which can also be found in Additional file 1.
Model fitting
Posterior match probabilities
As our primary goal is obtaining a list of matches, we are only interested in maximum a posteriori (MAP) estimates of the parameters. We employ simulated annealing on all parameters after the initial burn-in period of the MCMC. In addition, with the exception of the latent peptide means, the model parameters are being updated with a large number of observations, and will have fairly tight posterior distributions. Our model assumes that the product ion match likelihoods are independent from the peptide-level match likelihoods. New peptide match indicators are sampled from the full conditionals for each measured peptide in a random order. We sample match indicators in accordance with “Algorithm 2” for sampling mixture component indicators in DPGMM from Neal, 2000. We impose a restriction on match assignment such that only one measured peptide per dataset may be assigned to a given latent peptide. All other parameters are sampled from their full conditional distributions. After obtaining MAP estimates for all parameters, we then iteratively re-sample only the component assignment indicators, keeping track of how often each measurement is assigned to each latent peptide. These assignment proportions are used to make final matches.
Estimating the best alignment
Utilizing the assignment proportions from the final “assignment-only” MCMC iterations, we use a greedy algorithm to determine the final alignment. The best match (latent peptide-measurement pair) across the entire alignment is selected, and then the assignment proportions for measurements in each of the remaining datasets are examined for the current latent peptide. For each dataset, the measurement with the maximum assignment proportion is selected. All remaining match probabilities for the assigned measurements and the current latent peptide are set to zero. This process is repeated until no non-zero assignment proportions remain. These assignment probabilities represent the probability that a given measurement arises from a certain latent peptide. To compute the match probability of two measurements from two datasets, we compute the probability that both measurements are assigned to the same latent peptide - the product of the two individual latent peptide assignment probabilities. Users may utilize these match probabilities to examine matches of varying confidence. An illustration of the algorithm steps is shown in Panel C of Figure 4.
Parallelization
In order to make alignment of large datasets tractable, we split the datasets being aligned in the mass-to-charge ratio dimension, and perform separate alignments of each split in parallel. The boundaries of these splits fall only on mass-to-charge ratio “deserts”, which are empirically determined using all datasets being aligned. There exist gaps - also called “forbidden zones” in the mass distribution of all possible tryptic peptides [39]. These gaps have been utilized to improve peak de-noising techniques [40]. We calculate these gaps empirically based on the data in question, and utilize them to split the data for alignment. Such mass-to-charge ratio deserts are shown in Figure 5. When determining these mass-to-charge ratio deserts, we utilize the given matches to determine an approximate shift, scale, and match standard deviation. We then obtain the number of measured peptides in each mass-to-charge ratio bin the size of match standard deviation, and split the datasets at mass-to-charge ratio deserts defined as stretches of five or more empty bins. This ensures that any measured peptide features with the potential for being aligned to the same latent peptide feature (any measurements that should match one another) will be in the same alignment split. The hyperparameters set in the model initialization are shared across all alignment splits.
The boundaries of the product ion profiles are determined in a similar way. Utilizing the product ion annotations of the given matches, we obtain a mass-to-charge ratio match standard deviation of product ions. We then obtain the number of measured product ions in each mass-to-charge ratio bin the size of the match standard deviation, and set the product ion profile boundaries on mass-to-charge ratio deserts defined as stretches of three or more empty bins. The size of the product ion profile boundaries is as close to 1/K of the spanning product ion mass-to-charge ratio range as possible, given the boundaries are set within mass-to-charge ratio deserts.
Data
All data used in this analysis was obtained under MSE and/or HDMSE conditions (SYNAPT HDMS G2, Waters), and subject to Waters ProteinLynx Global SERVER (PLGS) processing. We utilize peptide features that have already been subject to peak detection, de-isotoping, charge state determination, and tentative identification (although not all identifications are utilized for alignment). All samples were separated by 1D nanoscale capillary ultraperformance Liquid Chromatography in a 90-minute gradient using a 5-40% acetonitrile/water (0.1% formic acid in each).
Three HCV cohorts were utilized in the alignment of serum samples from HCV patients to urine samples from OA patients. The first cohort included 47 patients ages 5 to 18 years from a clinical trial for HCV treatment [41]. The two additional HCV cohorts (n = 41,55) were selected from the Duke Hepatology Clinical Research (DHCR) database [42]. The pediatric clinical trial study was approved by the institutional review boards of the participating sites. Written informed consent was provided by all parents or guardians, and written assent was provided by all participants over 12 years of age. All patients present in the DHCR database cohorts, as well as all OA patients, provided written informed consent, and all study procedures were approved by the Duke University Institutional Review Board.
Analysis
All alignments were performed using Matlab on the Duke Shared Computing Resource, a cluster of Intel x86 compute notes running Linux. Each alignment was partitioned into a maximum of 250 splits, and each alignment partition was run on a single node with at least 8GB of memory. Our method does require considerable computation time - approximately 2 to 6 hours for the E. coli Lysate and HCV-OA alignments not utilizing product ions, and approximately 20-24 hours for the E. coli Lysate alignments utilizing product ions. Times vary by the number and size of datasets being aligned.