The Proteogenomic Mapping Tool
© Sanders et al; licensee BioMed Central Ltd. 2011
Received: 23 June 2010
Accepted: 22 April 2011
Published: 22 April 2011
High-throughput mass spectrometry (MS) proteomics data is increasingly being used to complement traditional structural genome annotation methods. To keep pace with the high speed of experimental data generation and to aid in structural genome annotation, experimentally observed peptides need to be mapped back to their source genome location quickly and exactly. Previously, the tools to do this have been limited to custom scripts designed by individual research groups to analyze their own data, are generally not widely available, and do not scale well with large eukaryotic genomes.
The Proteogenomic Mapping Tool includes a Java implementation of the Aho-Corasick string searching algorithm which takes as input standardized file types and rapidly searches experimentally observed peptides against a given genome translated in all 6 reading frames for exact matches. The Java implementation allows the application to scale well with larger eukaryotic genomes while providing cross-platform functionality.
The Proteogenomic Mapping Tool provides a standalone application for mapping peptides back to their source genome on a number of operating system platforms with standard desktop computer hardware and executes very rapidly for a variety of datasets. Allowing the selection of different genetic codes for different organisms allows researchers to easily customize the tool to their own research interests and is recommended for anyone working to structurally annotate genomes using MS derived proteomics data.
Expressed proteins provide experimental evidence that genes in the genome are being transcribed and translated to produce a protein product. Recently, a new structural genome annotation method, proteogenomic mapping, has been developed that uses identified peptides from experimentally derived proteomics data to identify functional elements in genomes and to improve genome annotation [1, 2]. Initially used for the structural annotation of prokaryotic genomes, proteogenomic mapping is rapidly gaining traction in eukaryotic genome annotation projects with larger genomes as a complementary method [3, 4].
Proteogenomic mapping can identify potential new genes or corrections to the boundaries of predicted genes by using peptide matches against the genome that do not match against the predicted proteome to generate expressed Protein Sequence Tags (ePSTs) . When aligned with the genome and combined with the published structural annotation, these ePSTs are indicative of translation throughout the genome and can serve to supplement traditional structural genome annotation methods [3–5].
While a number of research groups are becoming increasingly active in the field of proteogenomic mapping [1–5], there is a lack of published and standardized tools to rapidly and exactly map identified peptides back to the genome translated in all 6 reading frames. To our knowledge, there is only one comparable tool, PepLine , which utilizes a de novo based spectral identification methodology. In contrast our tool is implemented to work with the output from LC MS/MS combined with database search based spectral identification algorithms. PepLine uses peptide sequence tags (PSTs), short spectral match translations of 3-4 amino acids with flanking matches on either end for searches against the genome, where our tool works with peptides derived from MS/MS databases searches. While PepLine's use of PSTs allows the direct searching of spectra against the genome, a staged search method of searching spectra identified against database searches is an alternative.
The Proteogenomic Mapping Pipeline is free to obtain and use, written completely in Java, and available for all common computer platforms. It is licensed under the GNU GPLv3 license making it completely open source and making the source code and implementation methodology available to the end user . We have endeavored to make this tool as easy to use as possible and have provided both a command line version and a graphical user interface (GUI) for all common platforms.
Data Input and Customization
To generate the FASTA file of the peptides to be searched, it is expected that the user will have performed spectral matching for their MS dataset of interest against databases generated from both the proteome and the genome translated in all six reading frames and confirm these peptide identifications using a peptide validation strategy. After validation, the unique peptide identifications resulting from a database search against the genome that are not contained among the proteome peptide identifications should be used as the list of peptides to be searched.
The command line version of the Proteogenomic Mapping Pipeline allows the same inputs as the GUI to be specified as command line arguments and can be run on standard computer platforms (Windows, Linux, Unix, MacOS). An example of using the command line version of the program is included in the README file provided with the application.
The application translates the nucleotide database to protein in all 6 reading frames using the genetic code selected by the user (we provide the most common genetic codes from NCBI  which are represented in NCBI's standard format for genetic codes in the genetic_code_table file included with the application) and maps the peptides to the translated genome using the Aho-Corasick string searching algorithm to provide rapid and exact matches of peptides to the genome [10, 11]. The Aho-Corasick string matching algorithm  quickly locates all occurrences of keywords within a text string. The algorithm consists primarily of two phases. In the first, a finite state machine is constructed from the set of keywords. The time to construct this machine and its memory requirements are linearly proportional to the sum of the lengths of the keywords. The second phase consists of running the state machine using the text string as input. This phase takes time linearly proportional to the length of the text string. Thus, the time to run the entire algorithm is proportional to the sum of the length of the keywords and the length of the text string. In our case, the peptides for which to search are the keywords, and the reference genome against which to search is the text string.
Output File Description
Three output files are produced by the application. The first file is a FASTA file containing the ePSTs generated for the dataset. The second file is a more detailed tab separated text file containing the original peptide's identification, the peptide sequence, the FASTA header for the nucleotide sequence containing the match, the mapping start and end locations for the reverse translated peptide, the strand the nucleotide match, the reading frame of the match, the reverse translated peptide sequence, a longer nucleotide sequence extending from the 5' in-frame stop codon immediately upstream of the peptide to the 3' in-frame stop codon immediately downstream of the peptide, the ePST nucleotide sequence and the start and stop locations of the ePST on the nucleotide sequence, the length of the ePST, and the translated ePST. The third file is a GFF3 file containing the ePSTs generated for the dataset to provide researchers with a file format they can quickly load into genome browsers for data visualization.
Example Dataset Statistics
# unique peptides
# unique peptides mapping exclusively to genome
Results and Discussion
Channel Catfish Virus Peptides and ePSTs.
Runtime Analysis For Example Datasets.
# unique peptides mapping
Possible future updates to this application include parallelization of the searches against the genome in all 6 reading frames, and the introduction of better thread support to improve performance further on today's modern increasingly multi-core processors.
The Proteogenomic Mapping Pipeline provides a standalone tool that facilitates a streamlined mapping of peptides to a target genome for structurally genome annotation through the use of proteomics. This software can be used on a variety of current operating systems and is its ability to use a variety of genetic codes makes it easily customizable for researchers performing proteogenomic mapping in a variety of prokaryotes, eukaryotes, and viruses.
Availability and requirements
This research was funded in part by NIH grant 1R24GM079326-01A1, USDA grant 20053560017688 07010072, and NSF EPSCoR grant EPS-0903787. We also thank Dusan Kunec for providing us access to the channel catfish virus proteomics data.
- Jaffe JD, Berg HC, Church GM: Proteogenomic mapping as a complementary method to perform genome annotation. Proteomics 2004, 4: 59–77. 10.1002/pmic.200300511View ArticlePubMedGoogle Scholar
- McCarthy FM, Cooksey AM, Wang N, Bridges SM, Pharr GT, Burgess SC: Modeling a whole organ using proteomics: the avian bursa of Fabricius. Proteomics 2006, 6: 2759–2771. 10.1002/pmic.200500648View ArticlePubMedGoogle Scholar
- Castellana NE, Payne SH, Shen Z, Stanke M, Bafna V, Briggs SP: Discovery and revision of Arabidopsis genes by proteogenomics. Proc Natl Acad Sci USA 2008, 105: 21034–21038. 10.1073/pnas.0811066106PubMed CentralView ArticlePubMedGoogle Scholar
- Sevinsky JR, Cargile BJ, Bunger MK, Meng F, Yates NA, Hendrickson RC, Stephenson JL Jr: Whole genome searching with shotgun proteomic data: applications for genome annotation. J Proteome Res 2008, 7: 80–88. 10.1021/pr070198nView ArticlePubMedGoogle Scholar
- Kunec D, Nanduri B, Burgess SC: Experimental annotation of channel catfish virus by probabilistic proteogenomic mapping. Proteomics 2009, 9: 2634–2647. 10.1002/pmic.200800397View ArticlePubMedGoogle Scholar
- Ferro M, Tardif M, Reguer E, Cahuzac R, Bruley C, Vermat T, Nugues E, Vigouroux M, Vandenbrouck Y, Garin J, Viari A: PepLine: a software pipeline for high-throughput direct mapping of tandem mass spectrometry data on genomic sequences. J Proteome Res 2008, 7: 1873–1883. 10.1021/pr070415kView ArticlePubMedGoogle Scholar
- The GNU General Public License version 3[http://www.gnu.org/copyleft/gpl.html]
- NCBI Genetic Code Table[ftp://ftp.ncbi.nih.gov/entrez/misc/data/gc.prt]
- Pertea M, Lin X, Salzberg SL: GeneSplicer: a new computational method for splice site prediction. Nucleic Acids Res 2001, 29: 1185–1190. 10.1093/nar/29.5.1185PubMed CentralView ArticlePubMedGoogle Scholar
- Aho AV, Corasick MJ: Efficient String Matching: An Aid to Biblographic Search. Communications of the ACM 1975, 18: 333–340. 10.1145/360825.360855View ArticleGoogle Scholar
- Dandass YS, Burgess SC, Lawrence M, Bridges SM: Accelerating string set matching in FPGA hardware for bioinformatics research. BMC Bioinformatics 2008, 9: 197. 10.1186/1471-2105-9-197PubMed CentralView ArticlePubMedGoogle Scholar
- Wu Q, Krainer AR: AT-AC pre-mRNA splicing mechanisms and conservation of minor introns in voltage-gated ion channel genes. Mol Cell Biol 1999, 19: 3225–3236.PubMed CentralPubMedGoogle Scholar
- Nanduri B, Wang N, Lawrence ML, Bridges SM, Burgess SC: Gene model detection using mass spectrometry. Methods Mol Biol 604: 137–144.Google Scholar
- Corzo A, Kidd MT, Koter MD, Burgess SC: Assessment of dietary amino acid scarcity on growth and blood plasma proteome status of broiler chickens. Poult Sci 2005, 84: 419–425.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.