A robust linear regression based algorithm for automated evaluation of peptide identifications from shotgun proteomics by use of reversed-phase liquid chromatography retention time

Background Rejection of false positive peptide matches in database searches of shotgun proteomic experimental data is highly desirable. Several methods have been developed to use the peptide retention time as to refine and improve peptide identifications from database search algorithms. This report describes the implementation of an automated approach to reduce false positives and validate peptide matches. Results A robust linear regression based algorithm was developed to automate the evaluation of peptide identifications obtained from shotgun proteomic experiments. The algorithm scores peptides based on their predicted and observed reversed-phase liquid chromatography retention times. The robust algorithm does not require internal or external peptide standards to train or calibrate the linear regression model used for peptide retention time prediction. The algorithm is generic and can be incorporated into any database search program to perform automated evaluation of the candidate peptide matches based on their retention times. It provides a statistical score for each peptide match based on its retention time. Conclusion Analysis of peptide matches where the retention time score was included resulted in a significant reduction of false positive matches with little effect on the number of true positives. Overall higher sensitivities and specificities were achieved for database searches carried out with MassMatrix, Mascot and X!Tandem after implementation of the retention time based score algorithm.


Background
The science of proteomics encapsulates the large-scale identification, characterization and quantitation of proteins from biological samples. Mass spectrometry (MS) has been recognized as a powerful technique to study proteins. High-performance liquid chromatography (HPLC) coupled with tandem mass spectrometry (LC-MS/MS) is most commonly used in shotgun proteomics to resolve and identify proteolytic peptides generated from complex protein mixtures [1]. Peptide and protein identifications are usually derived from information contained in the tandem MS data. Automated database searching and de novo sequencing algorithms are routinely used to convert the MS/MS data into peptide and protein identifications [2]. Database search algorithms are more commonly used at this time due to their relatively low computational expense and higher compatibility with low mass accuracy and low quality MS/MS data [3,4].
It has also been recognized that the LC retention times of peptides are related to their sequences and can be used as complementary information for their identification and characterization [5]. Several methods have been developed to predict peptide retention times in reversed-phase liquid chromatography (RPLC) based on amino acid compositions and/or sequences [6][7][8][9][10][11][12][13][14][15][16][17][18][19][20][21][22]. High correlation between observed and predicted retention times for peptides in RPLC under different conditions has been achieved by use of these methods. Furthermore, these approaches can be combined with mass spectrometry to achieve better confidence in peptide identification than MS alone. For example, accurate mass tags combined with peptide retention time prediction has been effectively used by several groups to improve proteome characterization [23][24][25][26].
Peptide retention time prediction can also be used to refine and improve peptide identifications resulting from analysis of LC-MS/MS by database search software. In this way, false peptide matches from database search results can be minimized and true peptide matches can be confirmed with higher confidence. Krokhin et al reported an algorithm to refine the results obtained from the Global Proteome Machine [27] database searches [28]. In their approach, either internal or external standard peptides were used to estimate the regression parameters for the linear retention time prediction model used to refine their results. Strittmatter et al reported a post-database search method that evaluated peptide matches from SEQUEST [29] based upon their retention times. They reported a peptide retention time prediction model based on artificial neutral networks [15]. The prediction model was calibrated with highly reliable peptide matches from SEQUEST for each LC-MS/MS analysis [17]. An empirical discriminant score based on retention time and SEQUEST scores was also developed for peptide matches. It was shown that the number of reliable peptide matches was increased by use of peptide retention time information [17]. Klammer et al also developed an algorithm based on support vector regression to improve peptide identifications in tandem mass spectrometry by use of retention time prediction. As much as 50% of false peptide identifications in database search results from SEQUEST can be filtered and only 3% of true peptide matches were lost. The algorithm also trains the linear regression model dynamically for each data set [30].
We recently developed a robust linear regression based algorithm for automated evaluation of peptide identifica-tions from database search programs based on retention time in RPLC. The algorithm extends the retention time prediction algorithm and its use for peptide identification in off-line LC-MS/MS by Krokhin et al [16,28]. The algorithm described here works for on-line LC-MS/MS experiments and eliminates the need of retention time prediction model calibration by use of internal or external standard peptides. The algorithm is generic and can be used to evaluate peptide matches from any database search program. It has been included in a database search program, MassMatrix [31], to perform automated data analysis. A post-hoc retention time analysis program LR_RT was also developed to analyze search results from other publicly available programs, such as Mascot and X!Tandem. Furthermore, a score algorithm was developed to provide a statistical score for each peptide match based on its predicted and observed retention times.

Sample preparation and mass spectrometry
Bovine histones were isolated from bovine thymus tissue as described by Sures et al [32,33]. The bovine histone mixture was digested by use of trypsin in 100 mM ammonium bicarbonate buffer (pH = 8.0). Enzymes were used in 25:1 ratio (substrate:enzyme) and the mixture was incubated at 37°C for two hours. The digested peptides were identified by use of data-dependent nano-LC-MS/MS on an LCQ Deca XP ion trap mass spectrometer (Ther-moFisher, San Jose, CA, USA) as reported previously by Su et al [34]. In brief, 2.0 μL of bovine histone peptides at a total concentration of 0.1 μg/μL was injected and eluted off the capillary HPLC column (5 cm × 75 μm Pico Frit C18 column, 300 Å pore size, New Objective, Woburn MA) into the LCQ mass spectrometer at a flow rate of 250 nL/min. Mobile phases A and B were water with 0.1% acetic acid and acetonitrile with 0.1% acetic acid respectively. A linear gradient of 5-50% of mobile phase B over 35 minutes was used. The total run time was 70 minutes.

Database Search and Search Parameters
The .RAW data files obtained from the mass spectrometer were converted to mzXML files by use of ReAdW http:tools.proteomecenter.org/wiki/index.php?title=Soft ware:ReAdW. Tandem mass spectra that were not derived from singly charged precursor ions were considered as both doubly and triply charged precursors. The mzXML file was searched by use of MassMatrix http://www.mass matrix.net against a database that contained both the bovine histone database and a reversed NCBInr human protein database as a decoy database. The search options were set as follows: i) No variable or fixed modifications; ii) Enzyme: trypsin; iii) Missed Cleavages = 3; iv) Peptide Length = 6 to 30 amino acid residues; and v) Mass tolerances of 2.0 Da and 0.8 Da for the precursor and product ions respectively. The data set was also evaluated by Mascot [35] and X!Tandem [27]. The search parameters were identical to those in MassMatrix. The search results from Mascot and X!Tandem were then analyzed by the posthoc retention time analysis program, LR_RT http:// www.massmatrix.net to obtain retention time based scores for each peptide match from the two search programs.

Results and discussion
Algorithm development Overview Figure 1 shows the flow diagram for an algorithm that evaluates peptide matches based on their observed and predicted reversed-phase liquid chromatography retention times. The goal is to improve the confidence in peptide identification and lower the number of false positive matches returned by database search programs. In brief, the algorithm first creates a training data set from the pep-tide matches with the high statistical scores. These training data are then fitted to a robust linear regression (robust LR) model. Outliers due to false peptide matches in the training data set are removed by use of a recursive outlierremoval algorithm. The training data with outliers removed are then fitted to a linear regression (LR) model. A score for each peptide matches is then calculated.
A key advantage of this algorithm is that the LR model can be trained independently for each search. Thus there is no need to train or calibrate the LR model with internal or external standards for a given batch of samples. Furthermore, the algorithm is generic and can be used to evaluate peptide matches from any database search program. The algorithm can use different linear regression models for predicting peptide retention times under a variety of chromatographic conditions. For analysis of shotgun proteomic data sets, the linear regression model for peptide retention time prediction developed by Krokhin et al [16] was used. The model can be used to accurately predict retention times of tryptic peptides on reversed-phase (300 Å pore size) HPLC columns of various sizes with linear water-acetonitrile gradients containing trifluoroacetic acid, acetic acid, or formic acid as the ion-pairing agent [16]. The detailed implementation and performance of the algorithm are described in the next sections.

Linear regression model for predicting peptide retention times in RPLC
Retention times for true peptide matches identified by database search programs were assumed to follow a linear regression (LR) model: [16] where T is the retention time of the peptide match, H is the calculated hydrophobicity of the peptide match, a and b are parameters for the linear model that depend on the RPLC column and elution gradient, and ε is the residual of the model. The hydrophobicity for a peptide is calculated from the peptide sequence by use of the model developed by Krokhin et al [16]. Their model was derived from the work of Guo et al [10,11] in which the hydrophobicity of a peptide is calculated by [16] where K L is the length correction coefficient of the peptide, N is the length of the peptide in terms of the number of Flow diagram of the robust LR based algorithm for auto-mated evaluation of peptide matches from a database search engine by their retention times Figure 1 Flow diagram of the robust LR based algorithm for automated evaluation of peptide matches from a database search engine by their retention times. amino acid residues of the peptide, and R 1 cNt , R 2 cNt , and R 3 cNt are the R cNt values for the first, second, and third amino acid residues from the N-terminus of the peptide respectively. The length correction coefficient K L is calculated by eqn. 3.
The R C and R cNt values for the 20 common amino acid residues were reported previously by Krokhin et al [16].

Selection of training data
The retention time of the peptide match for each tandem mass spectrum was obtained from the mzXML or mzData file. We assume that retention times for true peptide matches will follow the LR model described in eqn. 1. The parameters of the LR model for true peptide matches are estimated from a training data set, which are then used to evaluate all peptide matches in the search result. In order to eliminate the need for LR model training by internal or external protein or peptide standards, the algorithm creates a training data set directly from the current database search results. This training data contains a selected number of peptide matches from the database search program with scores above a specified threshold. The accuracy and reliability of the LR model directly rely on the quality of this training data set. There are two major factors that affect model training: 1) size of the training data set and 2) false peptide matches included in the training data set that do not follow the linear regression model of the true matches. These false peptide matches should appear as outliers in our LR model and are referred to as outlier peptide matches. These outlier peptide matches have a negative impact on the calculation of parameters in the LR model. Increasing the threshold for statistical scores contained in the training data can minimize outlier peptide matches but will also significantly reduce the size of the training data set. By setting up moderate thresholds, a typical database search can generate training data sets containing 100 to 500 peptide matches. One challenge of this approach is that outlier peptide matches may be retained within the training data, especially for searches with large databases obtained at low mass accuracy.

Recursive outlier-removal algorithm
The LR model is vulnerable to outliers which distract the LR model training and lead to inaccurate results. As a result, it is crucial to remove as many outlier peptide matches as possible before model training. An algorithm based on robust LR was used to remove those outliers before model training. The algorithm required an iterative solution that is summarized as follows: 4. Repeat steps 1, 2 and 3 until no outliers are detected from the training data set in step 3 ( Figure 1).

Score algorithm based on peptide retention time
After outliers are removed from the training data set, the LR model is fitted to the training data set by ordinary least squares to give estimates of the parameters (eqn. A.1). The predicted retention time for a peptide with the calculated hydrophobicity H is given by eqn. A.2. The prediction error, D, and the absolute error, Δ, of the predicted retention time for the peptide are respectively defined as and The observed error of the prediction, δ, is calculated by where T obs is the observed retention time of the peptide. The score based on peptide retention time, C RT , is defined as the probability that the theoretical absolute error Δ for a peptide match is greater than or equal to the observed absolute error δ, given that the peptide is a true match, i.e.
Given the two assumptions that all true peptide matches follow the linear regression model in eqn. 1 and the linear regression model is a normal error regression model, peptide match is true peptide match is tr follows a t distribution with degrees of freedom of n -2 for true peptides (see Appendix 1 for details). The score is calculated by the following equation.
where is the standard error of the predicted retention time given in eqn. A.9, F t(n-2) (x) is the cumulative density function of the t distribution with degrees of freedom of n -2. Smaller δ gives higher C RT score and indicates a higher confidence for the peptide match.
All cumulative density functions of any continuous distributions, including F t(n-2) (x) for -∞ <x < ∞, follows a continues uniform distribution over the interval of [0, 1]. The distribution of is a continuous uniform distribution over the interval of [0.5, 1] due to δ > 0. Its probability density function is given in eqn. 9.
The probability density function of C RT for true peptide matches can be derived from that of Substitution of eqn. 9 into eqn. 10 yields the following.
Thus, C RT follows a continuous uniform distribution over the interval [0, 1]. Consequently, its cumulative density function is given in eqn. 12.
The theoretical distribution of C RT for random peptide matches is unknown and varies from one search to another.

Automate evaluation of peptide matches based on their retention times
The retention time score algorithm is included as part of the MassMatrix database search program to perform automated evaluation of peptide matches. Due to the robustness of the algorithm, the score threshold for selection of training data does not significantly affect the model training and results. The score threshold for selection of training data for the algorithm in MassMatrix was set to be ≥ 8.0 for both pp and pp2 scores [31] and ≥ 2.0 for pp tag score [36]. MassMatrix takes the mzXML, mzData and MGF data files as input data formats. The retention time based algorithm automatically scores peptide matches if the input data file is either mzXML or mzData. The retention time based algorithm is not used if the input data file is a MGF file due to the fact that MGF files lack retention times.

Post-hoc retention time scoring of other database search programs
The retention time based algorithm described herein is generic and a post-hoc retention time analysis program for all other database search programs, LR_RT, was developed to perform automated post-search evaluation of the peptide matches. The post-hoc analysis requires the original mzData or mzXML file along with a tab or space delimited .txt file of the search results. The search result file must contain the scan number, peptide sequence and score information for each peptide match. The program was tested on Mascot and X!Tandem search results. Search results in the tab delimited .txt files can be obtained from Mascot html search results or X!Tandem pepXML search results by use of Perl scripts available at http://www.mass matrix.net. Score thresholds for selection of training data for the retention time based algorithm are set to be ≥ 30 and ≤ 0.1 for Mascot and X!Tandem results respectively.

Evaluation of the robust linear regression based algorithm MassMatrix automated evaluation of peptide matches based on their retention times
The robust LR based algorithm built in MassMatrix was evaluated against experimental LC-MS/MS data from bovine histone digests acquired by use of a LCQ Deca XP+ MS. The data set contained 3166 tandem mass spectra and was searched against a database that contained a bovine histone database and the NCBInr reversed human protein database as decoy sequences. The complete list of peptide matches is provided in the additional file [see Additional file 1]. The decoy database was much larger than the bovine histone database and created ~1000 times as many theoretical peptides as the bovine histone database. False positive peptide matches from the bovine histone database were thus assumed to be negligible [37,38]. As a result peptide matches returned from the bovine histone database were considered as true positives (TPs) while those from the decoy database were considered to be false positives (FPs). Figure 2a shows the scatter plot of the original 254 peptide matches selected as training data. From the figure it is obvious that outlier peptide matches are present in the training data. The original training data set was fitted to the LR model without removal of any outliers. The coefficient of determination (R 2 value) for the LR model was 0.35, which indicated a poor correlation between peptide retention time and peptide hydrophobicity. Furthermore, the 99% confidence band from the linear model is too wide to be useful for scoring peptide matches. As expected, the training data set chosen by the database search program cannot be directly used to train the LR model due to outlier positive matches included in the training data. Figure 2b shows a scatter plot of the training data set that contained 143 peptide matches after removal of outliers by the recursive outlier-removal algorithm described above. Outlier removal resulted in a strong linear relationship between the retention times and the hydrophobicities of peptide matches in the training data. The training data set was fitted to the LR model, and the R 2 value was improved from 0.35 to 0.90 after removal of the outliers. Furthermore, the 99% confidence band of the LR models was much narrower after the outlier removal by the algorithm.
The accuracy and robustness of the algorithm is illustrated in the scatter plot for all peptide matches from the search (Figure 3a). The solid and dashed lines in the figure represent the LR model and its 99% confidence band fitted to the training data set with a size of 143 after outlier removal. The key concern is that the application of the LR model reduces false positives but not at the expense of true positives. In fact 203 of 211 (96.21%) true peptide matches were observed within the 99% confidence band, i.e. with C RT ≥ 0.01 (Figure 3b). In contrast, 279 of 715 (39.02%) false peptide matches were found within the 99% confidence band. Therefore, the majority (60.98%) of false peptide matches were filtered by the application of the LR model described herein while only 3.79% of true positive peptide matches were lost. The distributions of retention time based scores, C RT , for true positive and false peptide matches are shown in Figure 4. The score distribution for true peptide matches was close to that for the expected theoretical distribution described by eqn. 11 & 12. False peptide matches had much lower scores than true peptide matches where majority of the false matches had scores less than 0.01.
The algorithm was also evaluated by use of two publicly available LC-MS/MS data sets from significantly more complex samples. The first data set was created by use of LC-MS/MS on an LCQ ion trap mass spectrometer from a tryptic digest of a proteome sample from Deinococcus radiodurans MR-1 gram-positive bacteria. The data set (Dataset_021014.RAW for Deinococcus radiodurans Scatter plots of the training data set (a) before and (b) after removal of outliers by the recursive outlier-removal algorithm for the bovine histone data set data) along with the experimental details can be obtained at http://ncrr.pnl.gov/data/. The data set was searched by use of the MassMatrix database search program against a database that contained both the Deinococcus radiodurans database and a dominant reversed NCBInr human protein database used as a decoy database. The second data set was created by use of 2D-LC-MS/MS on an LCQ Deca XP+ mass spectrometer from the tryptic digest of a human proteome sample. The sample was separated by a SCX column in 11 salt steps and the fraction from each step was analyzed by a C18 RPLC-MS/MS. Eleven MS/MS data sets were generated. The data set that was created from the first fraction that contained the greatest number of MS/MS scans among all 11 data sets was used in our evaluation. The data set and the experimental details can be found at http://bioinformatics.icmb.utexas.edu/OPD/ [39]. The data set was searched against a database with a target NCBInr human database and a dominant decoy database. The dominant decoy database contained ten randomized NCBInr human database and one reversed human database. The search parameters for both data sets were the same as those used for the bovine histone data set.
The algorithm independently trained the peptide retention time prediction LR model for both the Deinococcus radiodurans data set and the human proteome data set. The R 2 of the linear regression models for the two data sets were 0.90 and 0.93. As shown in Figure 5a, 53.24% of FPs was filtered at a threshold of 0.01 for C RT with a loss of 5.47% of TPs for the Deinococcus radiodurans data set. For the human proteome data set, 40.02% of FPs was filtered as loss of 0.31% TPs. The results suggest that the algorithm implemented in MassMatrix can be effectively used to reduce false positives of database search results for LC-MS/MS proteomic data from complex samples.

Test of the assumptions of the algorithm
There are two assumption involved in the algorithm. The first is that all true positives follow the linear regression model. It can be seen from the previous discussion that this assumption was violated. However, this departure from the first assumption was small and only caused small losses (0.31 to 5.47%) of TPs.
The second assumption involved in the algorithm is that the linear regression model in eqn. 1 is a normal error model, i.e., the residuals of the TPs that follow the linear regression model are normally distributed. This assumption was tested by the quantile-quantile plots (Q-Q plots) of all the residuals of the TPs following the linear regression model, i.e. with C RT ≥ 0.01 for the three data sets. The

MassMatrix Results for the Data Sets from Deinococcus Radiodurans
Human Proteome three Q-Q plots for the three data sets are shown in Figure  6. It can be seen that the Q-Q plots for the bovine histone and Deinococcus radiodurans data sets formed a linear pattern and the normality assumption was valid (Figures  6a &6b). For the human proteome data set, there was a slight departure from the normality as shown in Figure 6c. However, the small departure from the normality is not significant and does not create any serious concerns.
Overall these results demonstrate that there was no significant departure from normality of the residuals for TPs that follow the linear model.

Post-hoc retention time analysis program for other database search programs
The post-hoc retention time analysis program, LR_RT, was used to evaluate Mascot and X!Tandem search results for the bovine histone data set with the same database and search parameters as used with the MassMatrix search. Peptide matches from these two database search programs were evaluated by our retention time based algorithm, LR_RT. A retention time based score, C RT , was calculated for each peptide match. The complete lists of peptide matches along with their C RT scores are provided in the additional files [see Additional files 2 &3]. The algorithm independently trained the peptide retention time prediction LR model for both the Mascot and X!Tandem search results. Similar to the case for MassMatrix, search results for Mascot and X!Tandem were significantly improved after retention time scoring. The final R 2 values from the LR models were 0.91 and 0.94 for Mascot and X!Tandem results respectively. More importantly, the LR_RT program was able to significantly reduce the number of false positives for both Mascot and X!Tandem search results. For Mascot research results, 66.85% of false positives were filtered with a threshold of 0.01 for C RT , whereas only 2.48% of true positives were filtered with the same threshold (Figure 7a). For X!Tandem results, 65.47% of false positives were filtered with a loss of 3.50% of true positives ( Figure 8a). Therefore, the majority of the false peptide matches from the two programs can be filtered by the retention time based score algorithm with a modest negative impact on the number of true positives.
The post-hoc retention time analysis program was also evaluated by Mascot and X!Tandem search results from the Deinococcus radiodurans and human proteome data sets from complex samples. The databases and search parameters in Mascot and X!Tandem were the same as those used in the MassMatrix searches for the two data sets. For the Mascot searches of the Deinococcus radiodurans and human data sets (Figures 7b &7c) and the X!Tandem search of the Deinococcus radiodurans data set (Figure 8b), the algorithm also effectively reduced false positives with small losses of true positives. However, the The Q-Q plots of all the residuals of the true positives following the linear regression model from the MassMatrix search results for the bovine histone data set, the Deinococcus radiodurans data set, and the human proteome data set Figure 6 The Q-Q plots of all the residuals of the true positives following the linear regression model from the MassMatrix search results for the bovine histone data set, the Deinococcus radiodurans data set, and the human proteome data set.

Sample quantiles
Theoretical quantiles algorithm was not applicable to the X!Tandem search of the human data set. This was due to the fact that X!Tandem did not return a significant number of true positives for the data set. The number of peptide matches with expectation value ≤ 0.1 from the target database of the search was 14, which was not enough for peptide retention time model training. Figure 9 displays the receiver operating characteristic (ROC) curves for the MassMatrix, Mascot and X!Tandem search results of the three data sets before and after removal of peptide matches with C RT < 0.01. For the human data set, the X!Tandem result was not shown due to the fact that X!Tandem did not return a significant number of true positives. It can be seen that higher sensitivities and specificities were achieved after filtering peptide matches with insignificant C RT scores for the searches of the bovine histone and Deinococcus radiodurans data sets in all three programs and the searches of the human proteome data set in MassMatrix and Mascot. Therefore, the false positive rates of search results in MassMatrix, Mascot and X!Tandem can be significantly lowered by including the new score algorithm based on peptide retention time.

ROC analysis
ROC curves of MassMatrix, Mascot and X!Tandem search results before (dashed line) and after (solid line) filtering peptide matches with C RT < 0.01 for the bovine histone data set, the Deinococcus radiodurans data set, and the human proteome data set Figure 9 ROC curves of MassMatrix, Mascot and X!Tandem search results before (dashed line) and after (solid line) filtering peptide matches with C RT < 0.01 for the bovine histone data set, the Deinococcus radiodurans data set, and the human proteome data set.

Conclusion
An algorithm based on robust LR has been developed for automated evaluation of peptide matches from database searches by use of peptide retention time in reversedphase HPLC. The recursive outlier-removal algorithm based on robust LR enables the algorithm to train the LR model on the fly for each search thus the need for internal or external protein or peptide standards is eliminated. The LR model for peptide retention in RPLC developed by Krokhin et al [16] was adopted in the current implementation of the algorithm.
The algorithm was implemented in the MassMatrix database search program and evaluated with a LC-MS/MS data set of bovine histones obtained on a LCQ Deca XP mass spectrometer. The R 2 value for LR model was improved from 0.35 to 0.90 after outlier removal. The majority (96.21%) of true peptide matches fell within the 99% confidence band for the trained LR model, whereas only 39.02% of false peptide matches fell in the same 99% confidence band. By use of this approach the majority (60.98%) of the false peptide matches can be filtered from the results based on retention time while only losing 3.79% of the true positive peptide matches.
A post-hoc retention time analysis program, LR_RT, was also developed to analyze peptide matches from other database search programs. The program was tested on Mascot and X!Tandem search results for the bovine histone data set. More than 60% of false positives in Mascot and X!Tandem search results were filtered by the program with a loss of less than 3.5% for true positives.
The algorithm was also tested on two publicly available data sets from complex samples. For the data set from a Deinococcus radiodurans proteome sample, the algorithm was able to reduce the majority of false positives at a small loss of true positives for searches in MassMatrix, Mascot and X!Tandem. For the data set from a human proteome sample, the algorithm could still effectively reduce false rates for searches in MassMatrix and Mascot. For the search of that data set in X!Tandem, the algorithm was not applicable due to the fact that X!Tandem was not able to catch a significant number of true positives.
A statistical score algorithm was developed for ranking peptide matches based on predicted and observed retention times. The score distribution for true peptide matches was close to its theoretical distribution, which indicates that the LR model trained by the robust LR based algorithm represents the true linear relationship between the peptide retention times in RPLC and their calculated hydrophobicities. False peptide matches tend to have much lower scores than true matches, and the majority of the false matches have scores less than 0.01. This score enables differentiation between true and false matches based on retention time. After removal of peptide matches with insignificant scores based on retention time, higher sensitivities and specificities were achieved and the false positive rates of the searches were significantly lowered as shown by the ROC analysis for all the three database search programs.

Availability and requirements
Project name: MassMatrix Retention Time Analysis.
Operating systems: Windows, Linux.
Other requirements: None.
License: None.  Due to the normality assumption of the residuals, the prediction error follows a normal distribution. Therefore, is a t distribution with degrees of freedom of (n -2).

Robust Linear Regression
Ordinary least-square estimates of LR models can be severely affected by outliers that may lead to incorrect results. Another approach called robust LR is less vulnerable to outliers. A commonly used solution of a robust LR model involves an iterative method which is summarized as follows: [40] 1. The ordinary least-square estimates of the LR model from eqn. A.1 are obtained as initial estimates of the regression parameters, .