Prediction of peptide drift time in ion mobility mass spectrometry from sequence-based features

Background Ion mobility-mass spectrometry (IMMS), an analytical technique which combines the features of ion mobility spectrometry (IMS) and mass spectrometry (MS), can rapidly separates ions on a millisecond time-scale. IMMS becomes a powerful tool to analyzing complex mixtures, especially for the analysis of peptides in proteomics. The high-throughput nature of this technique provides a challenge for the identification of peptides in complex biological samples. As an important parameter, peptide drift time can be used for enhancing downstream data analysis in IMMS-based proteomics. Results In this paper, a model is presented based on least square support vectors regression (LS-SVR) method to predict peptide ion drift time in IMMS from the sequence-based features of peptide. Four descriptors were extracted from peptide sequence to represent peptide ions by a 34-component vector. The parameters of LS-SVR were selected by a grid searching strategy, and a 10-fold cross-validation approach was employed for the model training and testing. Our proposed method was tested on three datasets with different charge states. The high prediction performance achieve demonstrate the effectiveness and efficiency of the prediction model. Conclusions Our proposed LS-SVR model can predict peptide drift time from sequence information in relative high prediction accuracy by a test on a dataset of 595 peptides. This work can enhance the confidence of protein identification by combining with current protein searching techniques.


Background
Ion mobility spectrometry (IMS) has gained significant attentions over the past few decades for rapid, highresolution separations power, which can separate ions on a millisecond time-scale [1][2][3]. As a separation technique which based on differences in size and shape of analytes, IMS has proven powerful in the fields of metabolomics, glycomics and proteomics [1,2]. Ion mobililtymass spectrometry (IMMS), an analytical technique by which IMS coupled with mass spectrometry (MS), have emerged as powerful tools for analyzing biological mixtures, especially for current proteomics studies [4][5][6][7]. By combination of the advantages of IMS and MS, IMMS opens up avenues for the detailed structural analysis of large and heterogeneous protein complexes, providing information on the stoichiometry, topology and cross section of their composition [8,9].
A typical proteomics experimental setup using IMMS consists of five components: sample introduction, compound ionization, ion mobility separation, mass separation as well as peptide and protein ion detection [10]. Although these five components all play essential roles in the process, ion mobility separation is crucial for its impact on the consequent mass analysis and peptide ion detection [11]. Ion mobility separation, by which the peptide ions with different cross-sections and molecular charges will be separated, adds a new dimension of separation and makes IMMS an attractive method for analyzing complex proteomics samples. Peptide ion separation can be enhanced by changing different gases, altering electric field strengths, and adopting non-linear electric field gradients, by which peptide identification can be facilitated to achieve high confidence [12]. Even though these efforts improve the separation capability of IMMS, they are still timeconsuming, and it is difficult to reproduce under different experimental conditions.
Although IMMS separates peptide ions based on differing cross-sections and molecular charge, the experimental measurement behaves in the way that peptides spend different time through the drift tube. It has been reported that the measurement of peptide ion drift time using IMMS is very reproducible [13][14][15][16][17][18]. Any two measurements of mobilities (or cross sections) recorded on the same instrument usually agree to within 1% relative uncertainty. Measurements performed by different groups usually agree to within 2%. As a characteristic of different ions, peptide ion drift time can be used to enhance confidence in protein identifications.
There are several efforts which attempt to computationally determine the mobile behaviour of peptide ions in IMS. Valentine et al. predict peptide ion cross sections using intrinsic size parameters (ISPs) and tested it on 271 singly-charged peptides [19]. A quantitative structureproperty relationship (QSPR) based approach was proposed for prediction of peptide drift time by Liu et al. and found the structure effect and the charge states of peptide ion contribute a lot to the drift time [20]. Shah et al. employed partial least squares (PLS) and support vector regression (SVR) based approaches to predict the drift time of massive peptide ions with different charge states and demonstrated both techniques significantly outperform the ISPs based calculation by a test on a high confidence database of 8,675 peptide sequences [21]. Zhang et al. presented a quantitative structure-spectrum relationship (QSSR) study to predict peptide drift time and found the sequence-based approach can get better fitting ability and predictive power but worse interpretability than the structure-based approach [22]. Our previous works also attempted to address the same problem by employing artificial neural networks and multiply linear regression models [23][24][25]. Although these studies contributed the drift time prediction of peptide ions a lot, ISP based calculations did not show the high performance in peptides with high charged states, and structure-based methods have to construct and optimize the geometrical structures of peptides which will bring inevitable errors into prediction models.
In this paper, a least square-support vectors regression (LS-SVR) model is presented to predict peptide ion drift time in IMMS just from the sequence-based features of peptide. The sequence pattern of each peptide was represented as a 36-component vector, which was consisted of for descriptors, i.e., molecular weight, sequence length, amino acid composition and pseudo amino acid composition. In construction of the LS-SVR regression, a 10-fold cross-validation strategy was employed to determine the optimized values of the regression parameters. Our proposed LS-SVR method was applied into three peptide ions datasets with different charge states, i.e., +1, +2, +3.

Results and discussion
In this work, all the raw data generated from the IMMS were processed using MassLynx V4.1, an instrument control software, to obtain the drift time for each peptide ion peak. MassLynx is a powerful software for analyzing and processing the data acquired from mass spectrometers which are developed Waters Corporation. The peptides generated from tryptic digestion of 20 pure proteins were used for our model development and testing in this study. Peptide charge status was manually assigned based on the m/z spacing between isotopic peaks. As a result, the total of 595 peptides assigned ions which came from the 20 proteins became the dataset for this work. Within this dataset, 212 peptides were singly charged, 306 were doubly and 77 were triply charged. More details can be found in our previous work [12,26].
IMS separate ions based on the fact ions with different shapes and charge states travel though the drift tube at different velocities. In the drift tube, the ions were pulled by a weak electric field and opposed by the inset buffer gas. The charge state is a very important factor for the drift time. Therefore, we developed the SVR models for singly-, doubly-and triply-charged peptides, respectively. In this work we denotes dataset of singly-charged peptides as DataS, doubly-charged peptides as DataD, and triply-charged as DataT. Table 1 shows the distributions of peptide molecular weight, sequence length and drift time in each of the three datasets. It can be seen that the smallest peptide just formed by 3 amino acids with singly-charge state, and the largest one have 34 amino acids from DataD and DataT, which indicate that peptides with large molecular weight and long amino acid sequences, tend to have high charge states. The peptide ion drift time is also significantly related to the overall ion charge state. The mean value of peptide drift time for the singly-charged peptides is 7.48 s while that of the doubly-charged and the triplycharged peptides are 3.07 s and 2.28 s, respectively. The peptides with high charge states drift through the cell in a relative high velocity. Another reason is the higher charge states the peptide is, the higher probability that they form a 3-dimensional spatial structure will be.

Prediction performance evaluation
In this study, we developed the LS-SVR models for predicting peptide drift time for the singly-, doubly-, and triply-charged peptides, respectively. A 10-fold crossvalidation strategy was employed in the training and test process of the regression models, by which all observations in each datasets are used for both training and validation. This cross-validation can provide reliable learning of our model from the original data. The purpose of this work is to predict ion drift time of peptides by elucidating the relationship between the dependent variable, i.e., peptide drift time, and the sequence-based peptide features we used, i.e., peptide molecular weight, sequence length, AAC and PseAAC. For regression analysis, there are many criteria by which they can be evaluated and compared. The root mean square error (RMSE) and coefficient of determination (R 2 ) are selected in this work to evaluate the predictive performance of our LS-SVR models.
where n is the number of peptide in the dataset, dt is the experimentally observed peptide ion drift time, dt the predicted drift time by LS-SVR models, dt is the overall average value of peptide drift time. R 2 takes any value between 0 and 1, with a value closer to 1 indicating the regression model is of better performance.
Furthermore, in order to assess the prediction accuracy of LS-SVR models, a prediction variation threshold, h t , was defined by the relative variation of the predicted drift time from the experimentally observed values. If the relative variation between observed and predicted drift time is smaller than h t , the prediction will be seen as reliable, otherwise, unreliable.
Where h is the prediction variation, dt' is the predicted peptide ion drift time and dt is the experimentally observed peptide ion drift time.

Parameters selection
As what state in Methods part, LS-SVR models with Gaussian kernel was adopted to predict peptides drift time. There are two important parameters for this kind of regression model, i.e., the width of Gaussian kernel parameter s, and the regularization factor g. The correct setting of these two parameters of the LS-SVR models is of critical importance in enabling us to achieve good regression performances. In this work, the grid-searching scheme is used to determine these two parameters based on cross validation strategy. Specifically, the s 2 and g were tuned simultaneously in a grid ranging from 2 -5 , 2 -4 , ..., 2 15 for s 2 and from 2 -5 , 2 -4 , ..., 2 9 for g. The prediction accuracy of LS-SVR models for each peptide dataset was seen as the objective function to determine the optimum combination of s 2 and g, where the value of h t was set as 0.15.
The accuracy curves for different combination of the s 2 and g in the three peptide datasets were shown in the Figure 1. It can be seen that the regression performance of LS-SVR models are heavily depend on the selection of the parameters s 2 and g. When g is fixed, the prediction accuracy goes up with the increase of s 2 to an apex and then goes down.  9 ]. Overall the three datasets, the value [2 11 , 2 8 ] can achieve the best prediction accuracy for the LS-SVR models when h t = 0.15. Therefore, the s 2 of 2 11 and g of 2 8 were selected for the subsequent analysis in this work.

Prediction performance
A 10-fold cross validation was implemented in the construction of LS-SVR models, by which the different separation of the original dataset will bring the changes of predicted drift time for each peptide. For evaluating the uncertainty in the regression performance of our model which come from the randomness of the dataset separations, the regression procedure was repeated for ten times. The mean of the prediction drift times for each peptide from these ten times experiments were taken as the finally predicted value. Also the variation of the ten times was studied to exam the stability of our proposed LS-SVR models.
The prediction performance was shown in Table 2. It can be seen that our models ca achieved very good prediction ability for different peptide dataset, i.e., 0.9811 for DataS, 0.9379 for DataD, and 0.8312 for DataT.
Comparing to DataS and DataD, the prediction accuracy of the triply-charge peptide ions in DataT is a little bit poor. One reason for this situation is that the dataset's size is small, i.e., 77 peptide in DataT, which can not provide sufficient information in the model training. Another reason, we believe, is that the charge state of DataT is higher than that of DataS and DataD, which usually cause the peptide longer. The mean length of peptides in DataT is 18.3, which is 1.4 times of that in DataD, and 2.3 times in DataS. The longer of the peptide length is, the more chance the peptide form the secondary structure will be. Obviously, the changes in space conformation will contribute the peptide's velocity in drift cell and therefore, affect the peptide ion's drift time.
It can be found from Table 2 that the prediction accuracy from the mean of the predicted drift times is better than the mean accuracy of the ten repeat experiments. It can get 0.0075, 0.0039 and 0.0479 for DataS, DataD, and DataT, respectively, which indicated that the combination regression model will improve the predictive power of predictors. From Table 2, it can also be seen that the standard deviation of the prediction accuracy of the ten repeat experiments is very small, i.e., 0.081, 0.061 and 0.025 for the three datasets. It demonstrate our LS-SVR models are stable and statistically valid because a small change in the data, such as the different split of the training and test dataset, may lead to large changes of the prediction performance.
The relative small RMSE and R 2 shown in Table 2    After the LS-SVR models had finished the regression analysis for the three datasets with different charge states ions, the variation threshold h t will decide which peptide can be predicted correctly. Figure 3 displays the relation between the fraction of peptide ions whose drift time are predicted correctly and the accuracy threshold h t . It can be seen that our proposed method can get best prediction performance in the DataS. The reason we believe is the peptides in DataS are small and have higher probability they adopt elongated conformations in order to minimize coulomb repulsion, while the peptides in DataT usually are large and have higher probability to form secondary structure when they go through the drift cell in IMMS instrument. It can be found even the variation threshold is set as 0.10, there are more than 90% peptides can be predicted correctly, by which the prediction performance of our LS-SVR model can be demonstrated. If the conformation information can be added into the regression model, the predictive power for doubly-and triply-charge peptides will be increased undoubtedly.

Conclusions
To enhance the confidence of peptide identification, a LS-SVR model was developed in this study to predict peptide ion drift time for IMMS measurements. In LS-SVR, there are two parameters, i.e., the width of Gaussian kernel parameter s, and the regularization factor g, have to be selected for their influence on the regression accuracy. A grid searching strategy was employed to optimize the selection of these two parameters. Based on the peptide sequence, a 34-component vector was extracted as representation to construct our LS-SVR models on three peptide ion datasets with different charge states. With the prediction accuracy threshold h was set to 0.15, we achieved very high performance, i.e., 0.9811 and 0.9379, for the peptide ions with singly-and doubly-charge, which indicated the prediction capability of the LS-SVR models. It is reasonable that there is a relative lower prediction accuracy of 0.8312 for DataT, for the peptides with higher charge states have a higher probability that they can form a secondary structure. This kind of situation will be improved if the structure information can be added into our proposed LS-SVR models; even more computational cost will be requested.

Peptide dataset
The total of 595 peptides of 20 pure proteins used in this work was reported in our previous work [12]. The proteins were purchased from Sigma Aldrich and used without further purification. The peptide fragments were produced from the pure proteins according to the details of the sample preparation section in the report, and then were analyzed by direct electrospray into the Synapt HDMS instrument (Waters). Peptide ion assignments were obtained from a peptide mass fingerprint for each tryptic digest. As a result, in the dataset with 595 peptide ions, there are 212 peptides were singly charged, 306 were doubly charged and 77 were triply charged. More details about the experimental processing of samples can be obtained from the work [12,26].

Support vector regression
Support vector machines, a specific class of machine learning algorithms which was firstly proposed by Vapnik and his co-workers in 1995 [12], have proven very effective for solving pattern classification problems, even for the dataset in small size. For a binary classification problem, the main idea of SVM is to select a hyper-plane that separates the positive from negative samples while maximizing the minimum margin. Currently, SVM has been became one of the most popular machine learning methods, which has been applied to various domains of interest, such as bioinformatics, cheminformatics, image processing, data mining, knowledge discovery, and etc. In many applications, SVM can achieve excellent performance for the character that the capacity of the SVM system is controlled by parameters that do not depend on the dimensionality of feature space [27][28][29][30][31][32].
In the same way as with classification task, SVM can also be applied to the case of regression which is called support vector regression (SVR). In statistics, regression analysis is a statistical technique for estimating the relationships among variables. All the regression tasks can be formulated as to seek an estimation function which can approximate the observations within an acceptable error range. In this study, least square support vector regression (LS-SVR), a version of SVR which can reduce the complexity of optimization processes, was adopted for the drift time prediction [33].
Given a training dataset D = {x i , y i }(i = 1, 2, ..., n), x R n , y R, where x i is the input vector, y i is its corresponding target vector and n is the size of the dataset, SVR can construct regression model by using nonlinear mapping function j(·) as follows: where w is the vector of coefficients and b a constant. Usually, w and b are obtained by minimizing the upper bound of generalization error. Accordingly, the regression problem in LS-SVR can be transformed into the following optimization problem [34]: where g is the regularization parameter, is applied to control the minimization of estimation error and the function smoothness, and e i is the error between actual output and predictive output of the i -th input data. The high value of g denotes the good fitting of the training data points is stressed, and in the case of noisy data a smaller g value should be taken to avoid overfitting. In order to solve the optimization problem, the Lagrangian function is formulated as following: where a = (a 1 , a 2 , ..., a l ) is the Lagrange multiplier. The KKT conditions are used for optimality by differentiating L with the variable w,b,e,a, which is shown as follows.
By solving the upper linear system, the final solution of the primal problem can be represented in the following form.
where K(•) is kernel function which can satisfy Mercer's condition corresponds to a dot product ion some feature spaces [34]. The most used kernel functions include the Gaussian RBF K(x, x i ) = exp(||x − x i || / 2s 2 ) with a width of s, sigmoid and the polynomial kernel K (x, x i ) = (a 1 xx i +a 2 ) d with an order of and constants a 1 and a 2 . Gaussian RBF kernel is employed in this study, and the kernel parameter s2 and g, therefore, should be determined firstly. Currently, many approaches have been applied in parameter optimization of SVR, such as experience [27], grid searching [35], particle swarm optimization(PSO) [36], genetic algorithm(GA) [37], simulated annealing algorithm [38]. Considering computing complexity, cross-validation grid searching, the most used method, is selected to determine the parameters s2 and g in LSSVR model.

Peptide representation
To implement LS-SVR model to predict peptide drift time in IMMS, each peptide have be represented as a vector with specific peptide features. Because each peptide is not consistent in the length, and the shape is affected by the charge state of the peptides, only features were extracted from the peptide sequence, therefore, are used to represent the peptide in this work.

Peptide molecular weight
In IMMS, the ions are pulled by a uniform electric field through the buffer gas in the drift cell. Therefore, the molecular weight of peptide is one of the most important parameters which can affect ion mobility. Karasek et al. found there is a linear relationship between the reduced mobility of alkylamines and molecular weight under a specific experimental setting [39]. Also, other researches reported that the reduced mobility is inversely proportional to ion mass [40]. For a peptide P whose sequence is consisted of N amino acid residues as follows: Where R i denote the i -th amino acid in the peptide. The molecular weight of P can be calculated as: where mw i is the molecular weight of i -th amino acid in the peptide sequence.

Sequence length
The sequence length (SL) of peptide, N, plays an important role in the formation of peptide's structure. The longer of the peptide sequence is, the more chance the peptide folds into a secondary or tertiary structure. Except charge states, IMS distinguishes ions based on the ion shapes which is affected by the sequence length. The previous work indicated that peptides only with primary structure will have smaller ion mobility than that with secondary structure, and smaller more than that with tertiary structure.

Amino acid composition
All the peptide information is contained in its complete amino acid sequence. Therefore, it is the best choice for representing each peptide by its complete sequence. Amino acid composition (AAC) is one of the popular approaches to address protein or peptide representation problem because it is simple, yet powerful feature in prediction of protein structure, interaction, and functional sites. Generally, there are only twenty standard amino acid residues are considered in AAC. Therefore, AAC is a 20-components vector, where each component shows the occurrence number of an amino acid type in the peptide sequence (in many works, ACC is expressed by the occurrence frequencies, not numbers). For peptide P, ACC can be expressed by Where a i denotes the normalized frequency of i -th type of amino acid in peptide P.

Pseudo-amino acid composition
Though AAC can represent peptides in a very simply way, it ignores all the information of amino acid sequence-order effects, which decide the local environment of each amino acid in the peptide. Therefore, Pseudo amino acid composition (PseAAC) was originally introduced by Kuo-Chen for representing proteins and had demonstrated its effectiveness in improving protein subcellular localization prediction, membrane protein type prediction and other works [41]. For peptide P, PseAAC could be formulated as PseAAC P = (p 1 , p 2 , · · · , p 20 , p 20+1 , · · · , p 20+λ ) T , (λ < N) (12) Where p 1 , p 2 , ..., p 20 , are associated with the conventional amino acid composition of P, which already represented by sequence length and ACC in above, and p 20+1 , p 20+2 , · · · , p 20+λ are the l correlation factors that reflect the 1st tier, 2nd tier, ..., and the l-th tier sequence order correlation patterns. Therefore, only p 20+1 , p 20+2 , · · · , p 20+λ in PseAAC P have been adopted for representing peptides. In this work, six characters of 20 amino acid, i.e., hydrophobicity, hydophilicity, mass, pK1(alpha-COOH), pK2(NH3) and pI(at 25°), have been used for calculated PseAAC P , and l is set up to 2.

Feature normalization
From the above section, it can be found that four types of sequence-based features were applied to represent peptides. However, these four features are of different physical dimension of quantity and different value ranges. The imbalanced expression level of different features will result in a variation in contribution of each of them to the drift time predictor. To remove the bias of expression level, all of the feature values have to be normalized to equally reflect (as much as possible) the influence of each feature. In this work, all values of each feature always fall within a fixed interval [-1, 1] by where f is the raw value of feature, f normalized denotes the normalized value of this feature, f min and f max are the minimum and maximum values of the corresponding feature category.

Regression model construction
In our experiment, regression predictor is designed using LS-SVR model to solve drift time prediction from peptide sequence-based features. Based on the description of peptide representation, the LS-SVR model for predicting peptide drift time are constructed on a vector consisted of four sequence-based features, of which MW is of with 1 dimension, SL with 1 dimension and AAC with 20 dimensions. For PseAAC, the dimension is 12 for we employed 2-tier sequence correlation pattern with 6 amino acid characters. As a result, each peptide is represented in the predictor by a 34-component vector. For the peptide datasets, i.e., DataS, DataD and DataT, we construct three LS-SVR model for each dataset because the determinative effect of charge state to ion mobility.

Cross-validation
To evaluate the prediction performance of each regression model, a 10-fold cross-validation strategy was adopted for regression analysis. Specifically for singly-charged peptides, DataS is randomly partitioned into 10 sub-datasets, of which a single sub-dataset is retained as the validation data for testing the model, and the remaining 9 sub-datasets are used as training data. After training processes were finished, the LS-SVR model can be applied to the prediction task. This process is then repeated 9 times with each of the ten sub-datasets used exactly once as the testing data. The 10 results from the folds are combined to evaluate the prediction performance.