Improvement of peptide identification with considering the abundance of mRNA and peptide
© The Author(s). 2017
Received: 7 July 2016
Accepted: 20 January 2017
Published: 16 February 2017
Tandem mass spectrometry (MS/MS) followed by database search is a main approach to identify peptides/proteins in proteomic studies. A lot of effort has been devoted to improve the identification accuracy and sensitivity for peptides/proteins, such as developing advanced algorithms and expanding protein databases.
Herein, we described a new strategy for enhancing the sensitivity of protein/peptide identification through combination of mRNA and peptide abundance in Percolator. In our strategy, a new workflow for peptide identification is established on the basis of the abundance of transcripts and potential novel transcripts derived from RNA-Seq and abundance of peptides towards the same life species. We demonstrate the utility of this strategy by two MS/MS datasets and the results indicate that about 5% ~ 8% improvement of peptide identification can be achieved with 1% FDR in peptide level by integrating the peptide abundance, the transcript abundance and potential novel transcripts from RNA-Seq data. Meanwhile, 181 and 154 novel peptides were identified in the two datasets, respectively.
We have demonstrated that this strategy could enable improvement of peptide/protein identification and discovery of novel peptides, as compared with the traditional search methods.
KeywordsBioinformatics Mass spectrometry RNA-Seq Machine learning Shotgun proteomics Proteogenomics
Mass spectrometry (MS)-based methods have become a powerful and main means for identifying peptides/proteins in proteomics studies. Generally, the acquired MS/MS data from mass spectrometry is analyzed with the software and searched against protein sequence databases for protein identification. Several such software are available, commercial or freely available, such as SEQUEST , MASCOT , X!Tandem , OMSSA , MyriMatch  and MS-GF+ . Generally, the algorithms development in these software aim at improving the estimation scores that evaluate the extent of peptide spectrum match (PSM) and reflect the quality of the cross-correlation between the experimental and the theoretical data. In general, the better the two datasets are matched, the higher scores are achieved. The top rank PSM is not necessarily correct, however, due to flaws of scoring algorithm or poor quality of MS/MS spectrum. Hence, correct match is introduced using a target-decoy search model to estimate a false discovery rate (FDR). Although sophisticated algorithms for annotation of mass spectra have dramatically developed, the identification rate to peptides/proteins upon MS/MS data is still not so satisfied yet, because a poor identification of peptides/proteins is related with many causal factors, such as low efficiency of peptide ionization, low-quality or noisy MS/MS spectra, dynamic range of protein abundances, the complexity of protein samples and flaws of scoring algorithm.
There are two method categories that are developed to improve the sensitivity of peptide/protein identification upon MS/MS data. One is the post-processing algorithm that is designed to validate and filter PSMs based on the search engine’s results, such as PeptideProphet derived from the empirical modeling , Percolator comes from the semi-supervised learning  and IPeak based on the multi-search engines . These algorithms usually incorporate additional information from the MS/MS experiments for re-scoring PSMs, such as retention time of peptide chromatography, peptide charge state, or mass accuracy. Another one is the algorithms to utilize external information, i.e. the information gained from the non-MS/MS-based experiments, such as RNA-Seq data [10–12]. Recently Wang et al. described an approach to utilize the mRNA abundance to limit the sizes of protein sequence databases as to improve the sensitivity of protein identification . Meanwhile, Avinash et al. proposed a method to utilize RNA-Seq and GPMDB protein observation frequency to rescore or adjust the protein identification probabilities as to augment the identification sensitivity, even though its application was restricted at protein but not at peptide or PSM level . Also Wu et al. described a novel bioinformatics workflow to focus on the identification of new peptides which were not present in the standard protein databases but in the datasets derived from RNA-Seq data . This workflow, however, doesn’t utilize the abundance information from the RNA-Seq data to assist in peptide identification. Though many efforts have been devoted to the two categories, there is lack of method that enables combination of the advantages from both methods.
In this work, we introduced a novel workflow of proteomic analysis by integration of the post-processing algorithm and the external information gained from RNA-Seq data. Through incorporating the abundance of mRNAs and peptides for rescoring PSMs, and the potential novel transcript sequences, we demonstrated the sensitivity of peptide/protein identification and discovery of novel peptides to be significantly improved in the new type of pipeline.
The two MS/MS datasets were used in this study, the MS/MS data for Jurkat cell line and mouse liver tissues generated by LTQ Orbitrap velos. The raw data were downloaded from the PeptideAtlas (http://www.peptideatlas.org/) or iProx (http://www.iprox.cn) data repository with the identifier PASS00215 or IPX00003601 (ftp://126.96.36.199/IPX00003600/IPX00003601/). The paired end 200 bp sequencing RNA-Seq data for the Jurkat cell line generated from Illumina HiSeq 2000 was downloaded at NCBI’s Gene Expression Omnibus (GEO) repository with accession number GSM1104129 . The paired-end 90 bp sequencing RNA-Seq data for mouse liver tissue generated by Wu et al. was downloaded from the Short Read Archive under study accession number SRP033468 .
RNA-Seq data processing
The analysis of RNA-Seq data was conducted under the Trapnell’s protocol . For Jurkat cell line, the sequence reads were mapped to the Ensembl human genome (release GRCh37.75) using Tophat (version 2.0.8). Transcriptome reconstruction and expression quantification were implemented by Cufflinks (version 2.2.1). For mouse liver, the parameters for Tophat and Cufflinks were followed by that suggested by Wu et al. . Fragments Per Kilobase of transcript per Million mapped reads (FPKM) was used for estimation of the transcriptional abundance for each transcript . Basically, the original data gained from pair-end sequencing was input into Cufflinks, and the transcript abundance was estimated with the optimized parameters in the program (The detailed scripts to generate the FKPM values were presented in Additional file 1).
The customized protein sequence database
After getting mapping result from Tophat, CustomProDB (version 1.7.0) was used to construct a customized protein database. In the customized database, an identified protein with its corresponding FPKM less than 0.1 was filtered out. Novel transcripts were constructed by Cufflinks, and further compared with reference annotation using Cuffcompare (version 2.2.1), in which transcripts labeled with j stand potentially novel isoforms (fragments), and with u represent unknown, intergenic transcripts. The translated peptides with the longest frame were added into the customized database.
Peptide search upon MS/MS data
The raw MS/MS data were converted into MGF and mzXML format by using msconvert in ProteoWizard software package (v. 3.0.5047). Mascot (version 2.3.02) was employed for peptide search upon MS/MS data against Ensembl human proteome database (release GRCh37.75) and the customized database, respectively. Trypsin was specified as the enzyme with a maximum of two missed cleavages. For the two datasets, precursor mass tolerance was set at 10 ppm, and fragment ion mass tolerance at 0.05 Da for Jurkat and at 0.5 Da for mouse liver. Carbamidomethylation of cysteine was set as a fixed modification, and oxidation of methionine was set as a variable modification. The automatic Mascot decoy database search was performed. The results of Mascot were processed by MascotPercolator (v2.07) [8, 19]. The q-value for identification was set to 1% at PSM or peptide level.
Peptide identification through integrating the abundance of peptides and transcripts
The abundance for each peptide based on extracted ion chromatogram (XIC) was estimated by a tool developed in-house (more details are described in Additional file 2). The transcript abundance was directly derived from the RNA-Seq data analysis. For a transcript that well matched with protein database, its abundance was directly assigned a feature in rescoring PSM, for a sequence from decoy database, a randomly selected transcript abundance was assigned, and for an un-transcribed sequence, the transcript abundance was assigned as zero. The two sets of quantitative features were taken by Percolator that is an efficacious semi-supervised learning method for rescoring of database searching result . To avoid overfitting, Percolator randomly splits the PSMs into three subsets and trains three separate SVM classifiers, each trained on two of the three subsets and tested on the remaining subset . In addition, in total there are 47 features derived from MascotPercolator output were also used for rescoring PSMs and were shown in (see Additional file 3: Figure S1). The detailed parameters and command line used for Percolator are presented in Additional file 1.
Results and Discussion
Construction of the customized database
Summary of peptide identification with 1% FDR in peptide level for different methods on two data sets
Jurkat cell line
DBref + DBnovel
DBref + DBnovel + Rlow
DBref + DBnovel + Rlow + FmRNA
DBref + DBnovel + Rlow + Fpeptide
DBref + DBnovel + Rlow + Fpeptide+mRNA
Improvement of peptide identification on account of transcript abundance
Improvement peptides identification on account of MS1 XIC
MS1 XIC areas for peptide MS1 spectra corresponding to peptide identification events were generally extracted from corresponding RAW data files, and were treated as an indicator for peptide abundance. We considered MS1 XIC as a feature to enhance the peptide identification rate, and took it into Percolator processing. By processing the same datasets with MS1 XIC as a feature, the peptides identified were 76259 for Jurkat cell line and 52170 for mouse liver, respectively. Compared the data without MS1 XIC treatment, the identification rate was improved to about 6.7% for Jurkat cell line and 4.2% for mouse live due to introducing the new feature. In the identified peptides through with/without MS1 XIC treatment, approximately 94% of them for Jurkat cell line and 97% for mouse liver were overlapped. As for the mis-overlapped peptides, 4318 for Jurkat cell line and 1698 for mouse liver were specifically identified after inputting MS1 XIC, while 342 and 521 for the two species were uniquely perceived under without MSI XIC treatment. These results hence endorsed our postulation that the MS1 XIC feature can benefit peptide identification in Percolator processing.
Further improvement of peptide identification with the combined features
As for building customized database (step 1) and adding the two features into Percolator processing (step 2), in order to further explorer which step is more important, when only performed the step 1, as shown in Table 1 (Method: DBref + DBnovel + Rlow), 72283 peptides were identified and this is 0.89% more peptide identification than the standard approach (taken reference protein as the database) (71645 peptides) on Jurkat cell line dataset. There were 50993 peptides identified and this is 2.11% more peptide identification than the standard approach (taken reference protein as the database) (49937 peptides) on mouse liver dataset. When performed the step 1 and 2 (Table 1, method: DBref + DBnovel + Rlow + Fpeptide+mRNA), 77682 peptides were identified and this is 8.43% more peptide identification than the standard approach (71645 peptides) and 7.47% more peptide identification than step 1 on Jurkat cell line dataset. There were 53024 peptides identified and this is 6.18% more peptide identification than the standard approach (49937 peptides) and 3.98% more peptide identification than step 1 on mouse liver dataset. The results indicated that step 2 is more important than step 1.
Permutation test for the features taken for improvement of peptide identification
Using RNA-Seq data including its qualitative and quantitative information is reasoned a promising strategy to improve the sensitivity of peptide identifications and identify novel peptides in proteomic analysis on the basis of MS/MS data. In this study, we described an approach how to integrate the post-processing algorithm with the RNA-Seq information for improving the sensitivity and accuracy of peptide identification. With incorporating of the transcript and peptide abundance as the feature to rescore PSMs during peptide searching, we demonstrated that this approach could significantly improve the sensitivity in peptide identification and novel peptide detection.
False discovery rate
Fragments Per Kilobase of transcript per Million mapped reads
Peptide spectrum match
Extracted ion current
This study was supported in part by the International Science & Technology Cooperation Program of China (2014DFB30020), the Chinese National Basic Research Programs (2014CBA02002, 2014CBA02005) and the National High-Tech Research and Development Program of China (2012AA020202). The funding body was not involved in the design of the study and collection, analysis, and interpretation of data or in writing the manuscript.
Availability of data and materials
The RNA-Seq and proteomics data sets of Jurkat cell line can be downloaded at NCBI’s Gene Expression Omnibus (GEO) repository and PeptideAtlas [RNA-Seq data: www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1104129; MS/MS data: https://db.systemsbiology.net/sbeams/cgi/PeptideAtlas/PASS_View?identifier=PASS00215]. The RNA-Seq and proteomics data sets of mouse liver can be downloaded at Short Read Archive and iProx [RNA-Seq data: www.ncbi.nlm.nih.gov/sra/SRX386467; MS/MS data: http://188.8.131.52/page/SDV015.html?subprojectId=IPX0000036001, FTP site for MS/MS data downloading: ftp://184.108.40.206/IPX00003600/IPX00003601/].
BW and SQL conceived and designed the project. CWM, SHX and GL performed the analysis. XX and XL participated in study design and project management. CWM, BW and SQL wrote the paper, and all authors revised and approved.
Does not apply.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Eng JK, McCormack AL, Yates Iii JR. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J Am Soc Mass Spectrom. 1994;5(11):976–89.View ArticlePubMedGoogle Scholar
- Perkins DN, Pappin DJ, Creasy DM, Cottrell JS. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis. 1999;20(18):3551–67.View ArticlePubMedGoogle Scholar
- Fenyo D, Beavis RC. A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. Anal Chem. 2003;75(4):768–74.View ArticlePubMedGoogle Scholar
- Geer LY, Markey SP, Kowalak JA, Wagner L, Xu M, Maynard DM, Yang X, Shi W, Bryant SH. Open mass spectrometry search algorithm. J Proteome Res. 2004;3(5):958–64.View ArticlePubMedGoogle Scholar
- Tabb DL, Fernando CG, Chambers MC. MyriMatch: highly accurate tandem mass spectral peptide identification by multivariate hypergeometric analysis. J Proteome Res. 2007;6(2):654–61.View ArticlePubMedPubMed CentralGoogle Scholar
- Kim S, Pevzner PA. MS-GF+ makes progress towards a universal database search tool for proteomics. Nat Commun. 2014;5:5277.View ArticlePubMedPubMed CentralGoogle Scholar
- Keller A, Nesvizhskii AI, Kolker E, Aebersold R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal Chem. 2002;74(20):5383–92.View ArticlePubMedGoogle Scholar
- Kall L, Canterbury JD, Weston J, Noble WS, MacCoss MJ. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat Methods. 2007;4(11):923–5.View ArticlePubMedGoogle Scholar
- Wen B, Du C, Li G, Ghali F, Jones AR, Kall L, Xu S, Zhou R, Ren Z, Feng Q, et al. IPeak: An open source tool to combine results from multiple MS/MS search engines. Proteomics. 2015;15(17):2916–20.View ArticlePubMedGoogle Scholar
- Wen B, Xu S, Sheynkman GM, Feng Q, Lin L, Wang Q, Xu X, Wang J, Liu S. sapFinder: an R/Bioconductor package for detection of variant peptides in shotgun proteomics experiments. Bioinformatics. 2014;30(21):3136–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Wen B, Xu S, Zhou R, Zhang B, Wang X, Liu X, Xu X, Liu S. PGA: an R/Bioconductor package for identification of novel peptides using a customized database derived from RNA-Seq. BMC Bioinform. 2016;17(1):244.View ArticleGoogle Scholar
- Li Y, Wang X, Cho JH, Shaw TI, Wu Z, Bai B, Wang H, Zhou S, Beach TG, Wu G, et al. JUMPg: An Integrative Proteogenomics Pipeline Identifying Unannotated Proteins in Human Brain and Cancer Cells. J Proteome Res. 2016;15(7):2309–20.View ArticlePubMedPubMed CentralGoogle Scholar
- Wang X, Slebos RJ, Wang D, Halvey PJ, Tabb DL, Liebler DC, Zhang B. Protein identification using customized protein sequence databases derived from RNA-Seq data. J Proteome Res. 2012;11(2):1009–17.View ArticlePubMedGoogle Scholar
- Shanmugam AK, Yocum AK, Nesvizhskii AI. Utility of RNA-seq and GPMDB protein observation frequency for improving the sensitivity of protein identification by tandem MS. J Proteome Res. 2014;13(9):4113–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Wu P, Zhang H, Lin W, Hao Y, Ren L, Zhang C, Li N, Wei H, Jiang Y, He F. Discovery of novel genes and gene isoforms by integrating transcriptomic and proteomic profiling from mouse liver. J Proteome Res. 2014;13(5):2409–19.View ArticlePubMedGoogle Scholar
- Sheynkman GM, Shortreed MR, Frey BL, Smith LM. Discovery and mass spectrometric analysis of novel splice-junction peptides using RNA-Seq. Mol Cell Proteomics. 2013;12(8):2341–53.View ArticlePubMedPubMed CentralGoogle Scholar
- Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter L. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc. 2012;7(3):562–78.View ArticlePubMedPubMed CentralGoogle Scholar
- Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5(7):621–8.View ArticlePubMedGoogle Scholar
- Brosch M, Yu L, Hubbard T, Choudhary J. Accurate and sensitive peptide identification with Mascot Percolator. J Proteome Res. 2009;8(6):3176–81.View ArticlePubMedPubMed CentralGoogle Scholar
- The M, MacCoss MJ, Noble WS, Käll L. Fast and Accurate Protein False Discovery Rates on Large-Scale Proteomics Data Sets with Percolator 3.0. J Am Soc Mass Spectrom. 2016;27(11):1719–27.View ArticlePubMedPubMed CentralGoogle Scholar
- Lee MV, Topper SE, Hubler SL, Hose J, Wenger CD, Coon JJ, Gasch AP: A dynamic model of proteome changes reveals new roles for transcript alteration in yeast. Mol Syst Biol 2014, 7(1):514–514.
- Nagaraj N, Wisniewski JR, Geiger T, Cox J, Kircher M, Kelso J, Paabo S, Mann M: Deep proteome and transcriptome mapping of a human cancer cell line. Mol Syst Biol. 2014;7(1):548–8.View ArticleGoogle Scholar