Artificial neural networks for the prediction of peptide drift time in ion mobility mass spectrometry
© Wang et al. 2010
Received: 9 November 2009
Accepted: 11 April 2010
Published: 11 April 2010
Skip to main content
© Wang et al. 2010
Received: 9 November 2009
Accepted: 11 April 2010
Published: 11 April 2010
There is an increasing usage of ion mobility-mass spectrometry (IMMS) in proteomics. IMMS combines the features of ion mobility spectrometry (IMS) and mass spectrometry (MS). It separates and detects peptide ions on a millisecond time-scale. IMS separates peptide ions based on drift time that is determined by the collision cross-section of each peptide ion in a given experiment condition. A peptide ion's collision cross-section is related to the ion size and shape resulted from the peptide amino acid sequence and their modifications. This inherent relation between the drift time of peptide ion and peptide sequence indicates that the drift time of peptide ions can be used to infer peptide sequence and therefore, for peptide identification.
This paper describes an artificial neural networks (ANNs) regression model for the prediction of peptide ion drift time in IMMS. Each peptide in this work was represented using three descriptors (i.e., molecular weight, sequence length and a two-dimensional sequence index). An ANN predictor consisting of four input nodes, three hidden nodes and one output node was constructed for peptide ion drift time prediction. For the model training and testing, a 10-fold cross-validation strategy was employed for three datasets each containing different charge states. Dataset one contains 212 singly-charged peptide ions, dataset two has 306 doubly-charged peptide ions, and dataset three has 77 triply-charged peptide ions. Our proposed method achieved 94.4%, 93.6% and 74.2% prediction accuracy for singly-, doubly- and triply-charged peptide ions, respectively.
An ANN-based method has been developed for predicting the drift time of peptide ions in IMMS. The results achieved here demonstrate the effectiveness and efficiency of the prediction model. This work can enhance the confidence of protein identification by combining with current database search approaches for protein identification.
Proteomics is the large-scale identifying and quantifying all expressed proteins in biological samples. Understanding protein expression, the structure and function of each protein and the interactions among them will facilitate the search of useful targets and biomarkers for pharmaceutical design. Currently, mass spectrometry (MS) is an indispensable tool for protein identification and quantification [1–6]. The typical procedures in proteomics include digestion of the protein mixture into peptides, peptide separation using multidimensional liquid chromatography (MDLC), and finally MS for quantification and tandem mass spectrometry (MS/MS) for identification of proteins from which the peptides were derived [7–10].
Ion mobility-mass spectrometry (IMMS), which combines the features of ion mobility spectrometry (IMS) and MS, separates and detects peptide ions on a millisecond time-scale [11–13]. A typical proteomics experimental setup using IMMS consists of five components: sample introduction, compound ionization, ion mobility separation, mass separation as well as peptide and protein ion detection . Firstly, peptides mixture is introduced into the IMMS system. All peptides are ionized by electrospray ionization (EI). The ionized peptide ions are then subjected to a drift tube for separation based on the mobility of peptide ions. The separated peptide ions are further submitted to an MS, where peptide ions are separated by mass-to-charge ratio (m/z) and finally detected by a mass detector. Although these five components all play essential roles in the process, ion mobility separation is crucial for its impact on the consequent mass analysis and peptide ion detection. Ion mobility allows for the separation of peptide ions based on differing cross-sections and molecular charge. This advantage makes it possible for peptides with the same mass-to-charge (m/z) ratio to be discriminated by the difference of their cross-section-to-charge ratio. To achieve high confidence peptide identification, many researchers have enhanced peptide ion separation based on changing the ion mobility conditions such as employing different gases, altering electric field strengths, and adopting non-linear electric field gradients [15–21]. Even though these efforts, attempting to change the experimental environment, can impact the observed results and improve the ability of IMMS instruments to separate peptides, they are time-consuming and may be difficult to reproduce with different instrumentation configurations.
IMS separates peptide ions based on drift time that is determined by the collision cross-section of each peptide ion in a given experiment condition. A peptide ion's collision cross-section is related to the ion size and shape resulted from the peptide amino acid sequence and its chemical modifications. Therefore, peptide ion drift time in IMS is actually correlated to the peptide amino acid sequence. Deciphering such inherent relation between the drift time of peptide ion and peptide sequence will significantly benefit not only the understanding of gas phase peptide chemistry, but also the identification of peptide and proteins in proteomics.
Peptide selection, fractionation and separation on chromatographic columns may be modelled with various methods. We have developed an algorithm to predict the fractionation of peptides in strong anion exchange (SAX) chromatography using a pattern classification technique based on artificial neural network (ANN) method [22, 23]. An ANN has also been used to predict peptide separation in reversed phase (RP) chromatography [24, 25]. Predicted peptide retention times have been applied to assist peptide identification . Like other chromatographic methods, IMMS separates peptides based on their chemical and physical properties that reflect the peptide amino acid sequence and associated chemical modifications. It has been reported that the measurement of peptide ion drift time using IMMS is very reproducible . Any two measurements of mobilities (or cross-sections) recorded on the same instrument usually agree to within 1% relative uncertainty. Measurements performed by different groups usually agree to within 2% . The high reproducibility of IMS measurements encouraged us to explore the possibility of predicting peptide ion drift time using commercial IMMS instrumentation. The predicted peptide ion drift time can be used to simulate peptide separation in IMMS and also can be used to enhance confidence in protein identifications.
In this paper, a computational method was proposed to predict peptide ion drift time in IMMS using an artificial neural networks (ANNs) regression model. In seeking a general property to estimate drift time of peptide ions in IMMS, sequence-based information was first extracted from each peptide, including peptide molecular weight, sequence length and a two-dimensional sequence index. The two-dimensional sequence index has two parameters designed to reflect the peptide amino acid sequence information based on the ionization constant (pKa) values of 20 amino acids. Thereafter, a 10-fold cross-validation strategy was employed for ANN model training and testing using three datasets with different charge state assignments. The developed ANN model was tested on a five-protein digest sample. The high prediction accuracy achieved in this work demonstrated the effectiveness and efficiency of the prediction model.
In this study we used the peptides generated from tryptic digestion of 20 pure proteins for our model development and testing. The raw data obtained from the mass spectrometer were first processed using instrument control software (MassLynx V4.1) to determine the drift time of each peptide ion. Peptide charge status was manually assigned based on the m/z spacing between isotopic peaks. Peptide ion assignment was achieved using a peptide mass fingerprint approach in which peptide ion assignment thresholds of ±0.02 Da were used. We assigned 595 peptide ions to the 20 proteins. Of these assigned peptide ions, 212 were singly charged, 306 were doubly charged and 77 were triply charged.
Experimental Datasets with Different Charge State Assignment
Charge state assignment
Number of peptides
In this study, we developed our ANN regression models for the singly-, doubly-, and triply-charged peptide ions, respectively. During the ANN model construction, we employed a 10-fold cross-validation strategy . For each of the three datasets, we first equally partitioned the entire dataset into 10 subsets in a random manner. Of the 10 subsets, a single subset was selected as validation data while the remaining 9 subsets were used to train the ANN model. The peptide drift times of the validation data were then predicted based on the trained ANN regression model. This process was repeated to ensure that every data subset was selected as the validation data for one time only. Therefore, 10 experiments were implemented for each charge specific dataset. By doing so, the drift time of each peptide ion in a dataset was predicted exactly once. The advantage of this cross-validation method is that all observations are used for both training and validation. This provides reliable learning of our model from the original data.
For a neural network model, the hidden layer configuration is very important because it introduces a nonlinear relationship into the network and provides the network with its ability to generalization. The model created in this paper is a back-propagation neural network model with a single hidden layer as increasing the number of hidden layers cannot improve the results . However, choosing the number of nodes in the hidden layer is difficult because there are no acceptable theories to deal with this problem. It is generally accepted that selecting more nodes for the hidden layer will enable the model to "learn" more from the training data and have more power and flexibility. However, too many hidden nodes will increase the risk of over-fitting and an incapability of generalization . Therefore, a balance between the learning ability and the generalization of the model must be investigated. Because of the complexity of the current problem and the number of nodes in the input and output layers, we first established a single-node hidden layer, and then the number of nodes in this hidden layer was increased iteratively by adding a single node in each iteration. We then chose an optimal number of the hidden layer nodes based on prediction results.
Prediction Accuracy as a Function of the Number of Hidden Layer Nodes under a Variation Threshold of 15% in Three Datasets
Hidden layer nodes
0.771 ± 0.053a
0.835 ± 0.059
0.712 ± 0.027
0.888 ± 0.023
0.936 ± 0.014
0.758 ± 0.028
0.944 ± 0.045
0.936 ± 0.009
0.742 ± 0.023
0.939 ± 0.023
0.929 ± 0.016
0.697 ± 0.020
0.922 ± 0.013
0.940 ± 0.004
0.742 ± 0.036
where η pred is the prediction variation, t pred is the predicted peptide ion drift time and t exp is the experimentally observed peptide ion drift time.
Table 2 shows that the standard deviations of prediction accuracy of our ANN regression model range from 0.004 to 0.053 with an average value of 0.026. Such low values of the standard deviations in all of the repeated experiments, regardless of the dataset (C1, C2 and C3) and the number of hidden layer nodes, indicates the significant stability of our proposed ANN regression model. Table 2 also shows that the ANN regression model performs better when the hidden layer has multiple nodes. The prediction performance of the ANN model reaches a plateau and starts to fluctuate when the hidden layer contains more than 3 nodes. The best prediction accuracy for the singly-charged peptide ions (dataset C1) was obtained in three hidden nodes with a 0.944 accuracy value. The best performance of triply-charged peptide ions (dataset C3) was two hidden layer nodes with 0.758 accuracy value. The best performance of doubly-charged peptide ions (dataset C2) was five hidden layer nodes with 0.940 accuracy determination. However, such performance improvement from two nodes to five nodes is not significant, which is only about 0.004 accuracy improvement. It is well known that more hidden layer nodes may result in a higher computational cost and a higher probability of over-fitting . Therefore, we chose a three-node hidden layer model for integrating the prediction performance in all of the three datasets.
The performance of our model on the triply-charged peptide ions is relatively poor. With 15% prediction variation threshold, the ANN model can only correctly predict the drift time of 74% of the triply-charged peptide ions. The prediction accuracy increases to 95% if the prediction variation is set to 27%. We believe there are three reasons for such poor prediction accuracy of the triply-charged peptide ions. One is that the volume of the training dataset is not sufficient to obtain a reliable model. The training dataset contains only 66 triply-charged peptide ions, while there are 179 singly-charged peptide ions and 267 doubly-charged peptide ions. The other reason is that the four peptide features extracted from the peptide sequence do not include peptide conformation information. In general, the triply-charged peptides are large peptides. For example, the average peptide molecular weight of the triply-charged peptides in the training data is 2046.30 Da and the average peptide sequence length is 18.4 amino acid residues (Figure 1). Such large peptides usually form secondary structure, which will contribute to the peptide's cross-section and therefore, affect the peptide ion's drift time. Additionally, the larger coulomb force experienced by the triply-charged peptide ions may cause a larger range in overall cross-section differences; that is, many more species (notably shorter peptides) may adopt elongated conformations in order to minimize coulomb repulsion. This increased diversity in size distribution further compounds the problem of insufficient training dataset size. Unfortunately, a method of exactly predicting peptide ion conformation is not developed yet.
Predicted Drift Times of Peptide ions in the Test Dataset Using an ANN Model with Three Hidden Layer Nodes
The testing data were peptide digests of five randomly selected proteins. The tryptic digests of each protein were analyzed separately on IMMS. The drift time prediction performance for the peptide digest of each individual protein indicates the quality of prediction performance of our developed ANN regression model on different experiments operated under the same experimental conditions. The mean differences between the observed and the predicted drift time values of all tryptic peptide ions for the five proteins are 0.168, 0.206, -0.142, -0.039, and -0.091 ms, respectively. This small prediction difference among different protein digests, i.e., the different IMMS experiments, indicates our proposed prediction model is robust for the prediction of peptide ion drift time from repetitious experiments.
It is well known that the prediction ability of an ANN depends on the model training. In the model training phase, there are two key elements. One is the quality of the original data; another is the architecture of model. These two facets are synergistically linked to each other. Because our original data is dependent on the original experimental environment, this model is applicable for the present experimental conditions. Under new conditions, our present model should be trained again and adjusted to the new states.
In this study, an ANN regression model was developed to predict peptide ion drift time for IMMS measurements. To evaluate our proposed model, we tested its performance on a testing dataset which was not used during the model construction and training. The similar prediction accuracy between the training dataset and the testing dataset indicates the possibility of using the prediction results of the present model to assist protein identification efforts in proteomics studies. We achieved 94.4% prediction accuracy for +1 peptide ions and 93.6% for +2 ions. The relatively high level of performance indicated the capability of our proposed method. In addition, a simple net architecture consisting of four input nodes, three hidden nodes and one output node, makes our model more effective because the more simple net architecture is, the faster the ANN training and prediction will be. A relatively low prediction accuracy, 74.2%, for the +3 peptide ions suggests that spatial conformations of peptides with higher charge states presents an additional level of sample complexity that is not currently addressed in our current ANN model. Combining the conformation information, such as secondary structure formation, ion elongation, or interaction between peptide residues, into the present model will improve prediction capability and is the aim of future work. Furthermore, we plan to generalize the ANN model to predict the drift time of peptides with various post-translational modifications.
20 proteins were purchased from Sigma Aldrich and used without further purification. A quantity (10 μg to 1 mg) of each protein was dissolved in 500 μL of denaturing solution (6 M urea in 200 mM ammonium bicarbonate, pH~8.0). To each protein sample, an aliquot of a stock solution of diothiothreitol (DTT, Sigma) was added such that the protein to DTT ratio (molar) was 1:40. The reduction reaction was allowed to proceed at 37°C for two hours. Next the samples were cooled on ice and an aliquot from a stock solution of iodoacetamide (IAM, Sigma) in 200 mM ammonium bicarbonate was added such that the protein to IAM ratio is 1:80. The reaction was allowed to proceed in darkness on ice for two hours. Then an aliquot of a cysteine stock solution in 200 mM ammonium bicarbonate is added to each sample such that the protein to cysteine ratio is 1:40. The quenching step was carried out at room temperature for 30 minutes. Next the sample solution was diluted with 200 mM ammonium bicarbonate buffer solution such that the final urea concentration was 2 M. Finally, an aliquot of a stock solution of trypsin (TPCK-treated, Sigma) in 200 mM ammonium bicarbonate was added to each sample such that trypsin was 2% of the protein content (by weight). The trypsin digestion was allowed to proceed for 24 hours at 37°C. Samples were desalted using solid phase extraction (Oasis HLB cartridges, Waters) and subsequently dried with a centrifugal concentrator (Labconco). Electrosprayed samples consisted of 1 × 10-4 to 1 × 10-2 mg of peptide digest dissolved in a water:acetonitrile:acetic acid (49:49:2) solution.
Peptide digest samples were analyzed by direct electrospray into the Synapt HDMS instrument (Waters). Each digest sample was infused through the electrospray needle at a flow rate of 5 μL·min-1. Dataset acquisition was carried out for a total of 3 minutes per sample. Peptide ions from each digest were separated in the T-wave instrument using a traveling wave height of 8.0 V and a speed of 300 m·s-1. The drift region of the Synapt was pressurized with 0.468 mBar of nitrogen gas. A total of 200 drift time bins were utilized in the work reported here. The duration of each bin corresponded to a single flight time distribution (0 to 2000 m/z for 250 μs) resulting in a drift time range of 50 ms. Flight time distributions were collected using the "V" reflectron mode of the Synapt instrument. Under these conditions, the typical resolving power and mass accuracy of the instrument were 10000 and 10 ppm, respectively.
Peptide ion assignments were obtained from a peptide mass fingerprint for each tryptic digest. The presence of a single component protein in each sample significantly increases the confidence of peptide ion assignments. The masses for dataset features were obtained based on m/z values and the isotopic spacing for each ion. These values were compared to theoretical peptide ions obtained from in-silico digests of proteins obtained from the Swiss-Prot protein database. Values of ± 0.02 Da were used as peptide ion assignment thresholds.
Many computational approaches have been proposed to analyze and process experimental data generated from MS for proteomics research [22, 32]. Among these techniques, ANN-based methods are good choices for their capability of deriving useful information from complicated or imprecise data without the need of a detailed understanding of the underlying phenomena [23, 24].
The goal of this network is to carry out a desired input-output mapping. The learning process of an ANN is to adjust the weights of the connections between nodes of different layer to an optimum set of values for this mapping. After training processes were finished, the ANN can be applied to the prediction task.
where m i is the ionization constant (pKa) value of the i-th amino acid residue in the peptide, and N is the sequence length of the peptide. The pKa value of each amino acid residue was derived in the same way as our former work [23, 34].
where p denotes the target peptide, mw molecular weight, sl sequence length, and si 1, si 2 denote the sequence index.
Based on the peptide features we used here, each peptide has been represented by a four-dimensional feature vector consisting of 1-D molecular weight, 1-D sequence length and 2-D sequence index. However, the features are derived from different facets of peptide sequence and have different units, which will bring an imbalanced expression level among peptide features. This results in a variation in contribution of each peptide feature to the predictor performance. Therefore, the feature values must be normalized to equally reflect (as much as possible) the influence of each feature.
where f is the raw value of feature mw, sl, si 1 and si 2, f normalized denotes the normalized value of this feature, f min and f max are the minimum and maximum values of the corresponding feature category. After normalization, all values of each feature always fall within a fixed interval [-1, 1].
The ANN regression model used in this study consisted of three layers (i.e., the input layer, the hidden layer and the output layer). Each peptide was encoded as a four-dimensional vector using the three features mentioned above. The input layer of our ANN model consisted of four nodes; the output layer only had one node. Additionally, the hidden layer configuration for the model was determined empirically by the number of nodes in the input and the output layers. We tested the number of hidden nodes from 1 to 5, and chose an optimized value by which our model achieved the best performance.
This work was supported by the NIH (1R41RR024306-01 and 1U24 CA126480-01) and the National Science Foundation of China (No.60803107). Additionally, this work was partially supported by the National Center for Glycomics and Glycoproteomics (NCGG) under grant number RR018942 through the National Institute of Health (NIH) within the National Center for Research Resources (NCRR).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.