PGA: an R/Bioconductor package for identification of novel peptides using a customized database derived from RNA-Seq
- Bo Wen†1,
- Shaohang Xu†1,
- Ruo Zhou1,
- Bing Zhang3,
- Xiaojing Wang3,
- Xin Liu1,
- Xun Xu1 and
- Siqi Liu1, 2Email author
© The Author(s). 2016
Received: 27 June 2015
Accepted: 9 June 2016
Published: 17 June 2016
Peptide identification based upon mass spectrometry (MS) is generally achieved by comparison of the experimental mass spectra with the theoretically digested peptides derived from a reference protein database. Obviously, this strategy could not identify peptide and protein sequences that are absent from a reference database. A customized protein database on the basis of RNA-Seq data is thus proposed to assist with and improve the identification of novel peptides. Correspondingly, development of a comprehensive pipeline, which provides an end-to-end solution for novel peptide detection with the customized protein database, is necessary.
A pipeline with an R package, assigned as a PGA utility, was developed that enables automated treatment to the tandem mass spectrometry (MS/MS) data acquired from different MS platforms and construction of customized protein databases based on RNA-Seq data with or without a reference genome guide. Hence, PGA can identify novel peptides and generate an HTML-based report with a visualized interface. On the basis of a published dataset, PGA was employed to identify peptides, resulting in 636 novel peptides, including 510 single amino acid polymorphism (SAP) peptides, 2 INDEL peptides, 49 splice junction peptides, and 75 novel transcript-derived peptides. The software is freely available from http://bioconductor.org/packages/PGA/, and the example reports are available at http://wenbostar.github.io/PGA/.
The pipeline of PGA, aimed at being platform-independent and easy-to-use, was successfully developed and shown to be capable of identifying novel peptides by searching the customized protein database derived from RNA-Seq data.
KeywordsProteomics RNA-Seq MS/MS Peptide identification Proteogenomics
Using tandem mass spectrometry (MS/MS) data, database-dependent searching is a popular approach for peptide identification. The searching relies on the completeness and quality of the reference database of the proteome. If a correspondent peptide sequence is not listed in the reference database, an MS/MS spectrum, even at high quality, would fail to identify a peptide. Generation of a comprehensive reference database is therefore a challenging task in bioinformatics analysis towards MS/MS signals. Some common databases, such as Ensembl , RefSeq , and UniProt , cannot satisfactorily meet this urgent requirement; however, some new solutions have recently been proposed to improve the completeness of proteome databases. Through some attempts, such as six-frame translation from the genome  and expressed sequence tags (ESTs) , including known coding variations  and alternative splicing events , databases with such combined information were constructed to offer opportunities to expand the data body of novel splices, genomic variants, and new genes. However, these methods lead to significantly increased database sizes but do not greatly improve the sensitivity of peptide identification. Recent studies have reported advances in peptide or protein identification with the aid of transcriptome databases, which were obtained from the unprecedented capabilities of high-throughput next-generation sequencing [8–12]. RNA-Seq technology indeed has provided qualitative or quantitative gene expression information on a whole genome scale at a single-base resolution. Since transcriptomic and proteomic analyses could be done on the same cells or tissues, a sample-specific database based upon RNA-Seq data would significantly enhance sensitivity for peptide identification and improve accuracy for finding novel peptides. Importantly, for non-model species whose genome sequences are absent, the transcript sequences derived from RNA-Seq data by de novo transcriptome assembly would be beneficial to construct the proteomic database for MS/MS searching. In this strategy, the technique bottleneck is how to create an accessible and flexible bioinformatic pipeline that efficiently harnesses RNA-Seq data for the discovery of protein variations . According to our knowledge, three new software, customProDB, an R package developed by Wang et al. , a workflow within Galaxy-P generated by Sheynkman et al. , and sapFinder developed by Wen et al. , have made important contributions to this field. However, customProDB only provides functions for database construction without offering functions for downstream analysis, such as database searching and post-processing, which are also very important for novel peptide identification. Galaxy-P provides functions for the SAP database and splice database; however, it does not include a function for novel transcript-coded peptides. The software sapFinder mainly focuses on the peptides related to single amino acid polymorphisms but not for general detection of novel peptides. Therefore, there is still much room for improved identification of novel peptides through the construction of a comprehensively customized proteomics database based upon RNA-Seq data.
Herein, we describe PGA, an R/Bioconductor package which enables an automatic process for constructing customized proteomic databases based upon RNA-Seq data with or without guidance from a reference genome, searching peptides using MS/MS data, post-processing and generating an HTML-based report with a visualized interface.
Construction of the customized proteomic database
There are two kinds of customized proteomic databases created with PGA. One was constructed from the analysis of RNA-Seq data with a reference genome. In this case, RNA-Seq data was analyzed by series software, such as the Genome Analysis Toolkit (GATK)  or SAMtools , TopHat , and Cufflinks , to generate three inputs aimed at the construction of a customized database. The three inputs included a Variant Call Format (VCF) file containing single nucleotide variants (SNVs) and INDELs generated either by the GATK or SAMtools, a bed format file containing the junction information produced by TopHat, and a GTF format file containing novel transcripts reconstructed by Cufflinks. The other one is constructed from the analysis of RNA-Seq data without a reference genome. In this case, the transcript sequences were de novo assembled using software such as Trinity . It is noted that the data format is important for the construction of a customized database, while the same data format, regardless of which software is used, is acceptable for PGA processing. To assist the construction of such a database with guidance from a reference genome, numerous pieces of genome annotation information, such as genome element region boundaries and protein coding sequences, were required, which were downloaded from Ensembl or the University of California, Santa Cruz (UCSC) table browser using the methods modified from customProDB. The functions and their uses for downloading this annotation information can be found in the user’s manual of PGA package. As for VCF and bed format files, customProDB could generate the RNA-Seq variants caused by SNVs, INDELs, and splice alternatives to the corresponding peptides. As for the GTF format file, the new transcripts were converted to the corresponding peptides based on three-frame translation with the strand information or six-frame translation without the strand information. Optionally, the new transcripts could be converted to peptides based on the longest open reading frame (ORF) in all reading frames. A customized proteomic database was therefore constructed, which contained all the canonical proteins, the potential novel peptides derived from RNA-Seq data, and their corresponding reverse sequences. All the proteins and peptides are in FASTA format and the FASTA headers for potential novel peptides are prefixed with “VAR” to distinguish them from the reference proteins. In general, a FASTA format file containing the de novo assembled transcript sequences that are achieved from the RNA-Seq analysis software, such as Trinity, but not from PGA, can be taken as input into PGA for proteomic database construction. As for this kind of database construction, the annotation information from Ensembl or UCSC is not required, and the transcript sequences can be translated to protein sequences by three-frame or six-frame translation or based on the longest ORF in all reading frames.
MS/MS data searching
X!Tandem  is a well-accepted and open-source search engine, and was taken as the default database searching method in PGA. In the workflow of PGA, the R package rTANDEM , an R encapsulation of X!Tandem, was automatically used to search the customized proteomic database against MS/MS spectra. It can take the different MS/MS data formats as input in database searching, such as DTA, PKL, or MGF. Alternatively, search results with a dat format from MASCOT  or mzIdentML  format from MS-GF+ , MyriMatch , OMSSA  (converting OMSSA result to mzIdentML by mzidLibrary ), and IPeak [28, 29] were also accepted by PGA.
Generation of the HTML-based report
Using the R package Nozzle , PGA outputted an HTML-based interactive report, which contained summary plots and tables, annotated spectra, and identification information of novel peptides and canonical peptides.
Results and discussion
PGA utility was evaluated using a published data set, in which RNA-Seq and proteomic data were collected from the Jurkat cell line in parallel . The RNA-Seq data were downloaded from NCBI’s Gene Expression Omnibus (GEO) repository with the accession number GSE45428, and the MS/MS data were downloaded from the PeptideAtlas repository  with the accession number PASS00215. The detailed processing steps for the data are described in the Additional file 1. Two workflows were evaluated. The first one was that the protein identification was based on the customized proteomics database derived from the RNA-Seq data analysis with reference genome guidance. The second one was that the protein identification was based on the customized database derived from de novo transcriptome assembly from RNA-Seq data without reference genome guidance by Trinity.
Identified transcripts and peptides at different numbers of input reads for Trinity
No. of reads
No. of transcripts (>200 bp)
No. of identified peptides (FDR <= 1 %)
Using RNA-Seq data to enhance MS analysis is a promising strategy to discover novel peptides and to improve the sensitivity of peptide identification. The main bottleneck for widespread application of this strategy is lack of easily used software. We provided a novel end-to-end solution to this problem by introducing a complete pipeline in the Bioconductor environment. This software was evaluated in a data set of the RNA-Seq and proteomic data collected in a human cell line in parallel. Through construction of a customized proteomics database derived from RNA-Seq, PGA was demonstrated as a feasible program for discovering novel peptides arising from genetic variation, alternative splice forms, and novel coding genes.
Availability and requirements
Project name: PGA software.
Project home page: http://bioconductor.org/packages/PGA/.
Operating system(s): Linux, Mac OSX, Windows.
Programming language: R, JAVA.
Other requirements: None.
Any restrictions to use by non-academics: GPL-2.
FDR, false discovery rate; GATK, the Genome Analysis Toolkit; MS/MS, Tandem mass spectrometry; PSM, peptide-spectrum match; SAP, single amino acid polymorphism; SNV, single nucleotide variants; VCF, Variant Call Format
This study was supported by the International Science & Technology Cooperation Program of China (2014DFB30020), Chinese National Basic Research Programs (2014CBA02002-A, 2014CBA02005) and the National High-Tech Research and Development Program of China (2012AA020202). We thank Guangyi Fan, Liangwei Li and Rui Guan for RNA-Seq data analysis.
BW conceived of and designed the project. SHX and BW wrote the code. BZ and XJW provide some code. SHX, BW, RZ tested the software. BW and SQL wrote the paper, and all authors revised and approved.
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Flicek P, Ahmed I, Amode MR, Barrell D, Beal K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fairley S, et al. Ensembl 2013. Nucleic Acids Res. 2013;41(Database issue):D48–55.View ArticlePubMedGoogle Scholar
- Pruitt KD, Tatusova T, Brown GR, Maglott DR. NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic Acids Res. 2012;40(Database issue):D130–135.View ArticlePubMedGoogle Scholar
- UniProt C. Update on activities at the Universal Protein Resource (UniProt) in 2013. Nucleic Acids Res. 2013;41(Database issue):D43–47.Google Scholar
- Fermin D, Allen BB, Blackwell TW, Menon R, Adamski M, Xu Y, Ulintz P, Omenn GS, States DJ. Novel gene and gene model detection using a whole genome open reading frame analysis in proteomics. Genome Biol. 2006;7(4):R35.View ArticlePubMedPubMed CentralGoogle Scholar
- Edwards NJ. Novel peptide identification from tandem mass spectra using ESTs and sequence database compression. Mol Syst Biol. 2007;3:102.PubMedPubMed CentralGoogle Scholar
- Li J, Su Z, Ma ZQ, Slebos RJ, Halvey P, Tabb DL, Liebler DC, Pao W, Zhang B. A bioinformatics workflow for variant peptide detection in shotgun proteomics. Mol Cell Proteomics. 2011;10(5):M110 006536.View ArticlePubMedPubMed CentralGoogle Scholar
- Mo F, Hong X, Gao F, Du L, Wang J, Omenn GS, Lin B. A compatible exon-exon junction database for the identification of exon skipping events using tandem mass spectrum data. BMC Bioinformatics. 2008;9:537.View ArticlePubMedPubMed CentralGoogle Scholar
- Wang X, Slebos RJ, Wang D, Halvey PJ, Tabb DL, Liebler DC, Zhang B. Protein identification using customized protein sequence databases derived from RNA-Seq data. J Proteome Res. 2012;11(2):1009–17.View ArticlePubMedGoogle Scholar
- Wen B, Xu S, Sheynkman GM, Feng Q, Lin L, Wang Q, Xu X, Wang J, Liu S. sapFinder: an R/Bioconductor package for detection of variant peptides in shotgun proteomics experiments. Bioinformatics. 2014;30(21):3136–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Wu P, Zhang H, Lin W, Hao Y, Ren L, Zhang C, Li N, Wei H, Jiang Y, He F. Discovery of novel genes and gene isoforms by integrating transcriptomic and proteomic profiling from mouse liver. J Proteome Res. 2014;13(5):2409–19.View ArticlePubMedGoogle Scholar
- Tay AP, Pang CN, Twine NA, Hart-Smith G, Harkness L, Kassem M, Wilkins MR. Proteomic Validation of Transcript Isoforms, Including Those Assembled from RNA-Seq Data. J Proteome Res. 2015;14(9):3541–54.View ArticlePubMedGoogle Scholar
- Evans VC, Barker G, Heesom KJ, Fan J, Bessant C, Matthews DA. De novo derivation of proteomes from transcriptomes for transcript and protein identification. Nat Methods. 2012;9(12):1207–11.View ArticlePubMedPubMed CentralGoogle Scholar
- Sheynkman GM, Johnson JE, Jagtap PD, Shortreed MR, Onsongo G, Frey BL, Griffin TJ, Smith LM. Using Galaxy-P to leverage RNA-Seq for the discovery of novel protein variations. BMC Genomics. 2014;15:703.View ArticlePubMedPubMed CentralGoogle Scholar
- Wang X, Zhang B. customProDB: an R package to generate customized protein databases from RNA-Seq data for proteomics search. Bioinformatics. 2013;29(24):3235–7.View ArticlePubMedPubMed CentralGoogle Scholar
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20(9):1297–303.View ArticlePubMedPubMed CentralGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Genome Project Data Processing S. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25(16):2078–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009;25(9):1105–11.View ArticlePubMedPubMed CentralGoogle Scholar
- Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010;28(5):511–5.View ArticlePubMedPubMed CentralGoogle Scholar
- Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011;29(7):644–52.View ArticlePubMedPubMed CentralGoogle Scholar
- Craig R, Beavis RC. TANDEM: matching proteins with tandem mass spectra. Bioinformatics. 2004;20(9):1466–7.View ArticlePubMedGoogle Scholar
- Fournier F, Joly Beauparlant C, Paradis R, Droit A. rTANDEM, an R/Bioconductor package for MS/MS protein identification. Bioinformatics. 2014;30(15):2233–4.View ArticlePubMedGoogle Scholar
- Perkins DN, Pappin DJ, Creasy DM, Cottrell JS. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis. 1999;20(18):3551–67.View ArticlePubMedGoogle Scholar
- Jones AR, Eisenacher M, Mayer G, Kohlbacher O, Siepen J, Hubbard SJ, Selley JN, Searle BC, Shofstahl J, Seymour SL, et al. The mzIdentML data standard for mass spectrometry-based proteomics results. Mol Cell Proteomics. 2012;11(7):M111 014381.View ArticlePubMedPubMed CentralGoogle Scholar
- Kim S, Pevzner PA. MS-GF+ makes progress towards a universal database search tool for proteomics. Nat Commun. 2014;5:5277.View ArticlePubMedGoogle Scholar
- Tabb DL, Fernando CG, Chambers MC. MyriMatch: highly accurate tandem mass spectral peptide identification by multivariate hypergeometric analysis. J Proteome Res. 2007;6(2):654–61.View ArticlePubMedPubMed CentralGoogle Scholar
- Geer LY, Markey SP, Kowalak JA, Wagner L, Xu M, Maynard DM, Yang X, Shi W, Bryant SH. Open mass spectrometry search algorithm. J Proteome Res. 2004;3(5):958–64.View ArticlePubMedGoogle Scholar
- Ghali F, Krishna R, Lukasse P, Martinez-Bartolome S, Reisinger F, Hermjakob H, Vizcaino JA, Jones AR. Tools (Viewer, Library and Validator) that facilitate use of the peptide and protein identification standard format, termed mzIdentML. Mol Cell Proteomics. 2013;12(11):3026–35.View ArticlePubMedPubMed CentralGoogle Scholar
- Wen B, Du C, Li G, Ghali F, Jones AR, Kall L, Xu S, Zhou R, Ren Z, Feng Q, et al. IPeak: An open source tool to combine results from multiple MS/MS search engines. Proteomics. 2015;15(17):2916–20.View ArticlePubMedGoogle Scholar
- Wen B, Li G, Wright JC, Du C, Feng Q, Xu X, Choudhary JS, Wang J. The OMSSAPercolator: an automated tool to validate OMSSA results. Proteomics. 2014;14(9):1011–4.View ArticlePubMedGoogle Scholar
- Muth T, Vaudel M, Barsnes H, Martens L, Sickmann A. XTandem Parser: an open-source library to parse and analyse X!Tandem MS/MS search results. Proteomics. 2010;10(7):1522–4.View ArticlePubMedGoogle Scholar
- Helsens K, Martens L, Vandekerckhove J, Gevaert K. MascotDatfile: an open-source library to fully parse and analyse MASCOT MS/MS search results. Proteomics. 2007;7(3):364–6.View ArticlePubMedGoogle Scholar
- Reisinger F, Krishna R, Ghali F, Rios D, Hermjakob H, Vizcaino JA, Jones AR. jmzIdentML API: A Java interface to the mzIdentML standard for peptide and protein identification data. Proteomics. 2012;12(6):790–4.View ArticlePubMedPubMed CentralGoogle Scholar
- Karpova MA, Karpov DS, Ivanov MV, Pyatnitskiy MA, Chernobrovkin AL, Lobas AA, Lisitsa AV, Archakov AI, Gorshkov MV, Moshkovskii SA. Exome-driven characterization of the cancer cell lines at the proteome level: the NCI-60 case study. J Proteome Res. 2014;13(12):5551–60.View ArticlePubMedGoogle Scholar
- Nesvizhskii AI, Keller A, Kolker E, Aebersold R. A statistical model for identifying proteins by tandem mass spectrometry. Anal Chem. 2003;75(17):4646–58.View ArticlePubMedGoogle Scholar
- Gehlenborg N, Noble MS, Getz G, Chin L, Park PJ. Nozzle: a report generation toolkit for data analysis pipelines. Bioinformatics. 2013;29(8):1089–91.View ArticlePubMedPubMed CentralGoogle Scholar
- Sheynkman GM, Shortreed MR, Frey BL, Scalf M, Smith LM. Large-scale mass spectrometric detection of variant peptides resulting from nonsynonymous nucleotide differences. J Proteome Res. 2014;13(1):228–40.View ArticlePubMedGoogle Scholar
- Desiere F, Deutsch EW, King NL, Nesvizhskii AI, Mallick P, Eng J, Chen S, Eddes J, Loevenich SN, Aebersold R. The peptideatlas project. Nucleic Acids Res. 2006;34 suppl 1:D655–8.View ArticlePubMedGoogle Scholar
- Blakeley P, Overton IM, Hubbard SJ. Addressing statistical biases in nucleotide-derived protein databases for proteogenomic search strategies. J Proteome Res. 2012;11(11):5221–34.View ArticlePubMedPubMed CentralGoogle Scholar
- Jagtap P, Goslinga J, Kooren JA, McGowan T, Wroblewski MS, Seymour SL, Griffin TJ. A two-step database search method improves sensitivity in peptide sequence matches for metaproteomics and proteogenomics studies. Proteomics. 2013;13(8):1352–7.View ArticlePubMedPubMed CentralGoogle Scholar