The Protein Identifier Cross-Referencing (PICR) service: reconciling protein identifiers across multiple source databases
© Côté et al; licensee BioMed Central Ltd. 2007
Received: 30 May 2007
Accepted: 18 October 2007
Published: 18 October 2007
Each major protein database uses its own conventions when assigning protein identifiers. Resolving the various, potentially unstable, identifiers that refer to identical proteins is a major challenge. This is a common problem when attempting to unify datasets that have been annotated with proteins from multiple data sources or querying data providers with one flavour of protein identifiers when the source database uses another. Partial solutions for protein identifier mapping exist but they are limited to specific species or techniques and to a very small number of databases. As a result, we have not found a solution that is generic enough and broad enough in mapping scope to suit our needs.
We have created the Protein Identifier Cross-Reference (PICR) service, a web application that provides interactive and programmatic (SOAP and REST) access to a mapping algorithm that uses the UniProt Archive (UniParc) as a data warehouse to offer protein cross-references based on 100% sequence identity to proteins from over 70 distinct source databases loaded into UniParc. Mappings can be limited by source database, taxonomic ID and activity status in the source database. Users can copy/paste or upload files containing protein identifiers or sequences in FASTA format to obtain mappings using the interactive interface. Search results can be viewed in simple or detailed HTML tables or downloaded as comma-separated values (CSV) or Microsoft Excel (XLS) files suitable for use in a local database or a spreadsheet. Alternatively, a SOAP interface is available to integrate PICR functionality in other applications, as is a lightweight REST interface.
We offer a publicly available service that can interactively map protein identifiers and protein sequences to the majority of commonly used protein databases. Programmatic access is available through a standards-compliant SOAP interface or a lightweight REST interface. The PICR interface, documentation and code examples are available at http://www.ebi.ac.uk/Tools/picr.
Biological data is being generated at an unparalleled rate and data analysis is becoming a key challenge in bioinformatics and systems biology. Two common tasks that are more difficult than they should be are identifier unification, where datasets from various sources must be merged together for analysis and identifier translation, where identifiers from one source (e.g. NCBI gi number) need to be converted to those from another source (e.g. Ensembl) so that they can be used in database specific tools and queries. A major hindrance to the effective implementation of those tasks is that data comes from multiple sources, each using a proprietary identifier scheme that is not always easily traceable to a specific provider.
It is common to observe the same protein sequence being referred to by multiple identifiers. Redundant databases may even assign multiple identifiers to the same sequence. This problem is compounded by the fact that identifiers are unstable and can (and do!) disappear from source databases. For example, it is common for hypothetical proteins to be replaced when gene prediction algorithms are updated. Identifiers from in-house or proprietary databases are unknown to the outside world. At best, protein identifier translation into a common search space is a tedious task. At worst, it is an impossible one.
The major reference databases, such as the Universal Protein Knowledge Base (UniProtKB) , Ensembl  and the NCBI RefSeq  maintain a comprehensive list of cross-references to each other but full coverage is difficult to achieve because these databases have different production cycle and release schedules. Smaller, more specialized databases or proprietary ones might not be included in the cross-referencing process described above and will not be linked from these databases. Ultimately, this means that users must still query multiple sources to ensure that they have a complete picture with the latest information available.
The mapping problem has been tackled before by many groups using varied approaches. Unified identifier schemes have been proposed in the past, such as Life Science Identifiers (LSID)  and Sequence Globally Unique Identifiers (SEGUID) , but their adoption remains limited.
Many tools have been investigated but were found wanting, either because of the limited scope of databases or species they cover, their lack of API to use for batch or programmatic access, or because they are slanted to use in one particular field. Others have limited usability, such as few variables per request or requiring knowledge about the exact source and destination database.
For example, SeqDB  imports sequence information from external sources and generates a list of known aliases. However, coverage of synonyms is only limited to a small number of source databases and is only available to use interactively online using a web browser. IDConverter and IDClight  are web-based tools that map between clones, gene identifiers and protein accession numbers but the mappings are restricted to three species (human, rat and mouse) and only cover a small number of sources. IDClight does offer the possibility to use web links to perform one mapping per request, but datasets are only refreshed every two months . The National Cancer Institute caBIG GeneConnect project will offer both programmatic and interactive queries, but is currently limited to mappings between Ensembl, RefSeq and UniProt .
The ID Mapping service offered by Protein Information Resource (PIR)  has limited functionality in that it can only map between two sources per request, meaning that if the user wishes to map proteins from SGD, IPI and Genbank to UniProt, three requests must be made (SGD to UniProt, IPI to UniProt and Genbank to UniProt). Also, not all mappings are available. For example, it is possible to map from SGD to UniProt and from Genbank to UniProt, but not from SGD to Genbank.
MatchMiner  is aimed more towards gene name and gene product mappings and is limited to only two species (human and mouse). Onto-Translate , SOURCE  and Resourcerer  are designed to be used primarily for microarray and gene expression data analysis and as such, are not suitable for general use as they are gene-centric rather than protein-centric.
PROMPT  is a standalone comparative proteomics tool that can perform protein mapping based on sequence similarity as one of its functions. However, it is up to the user to download the source files and load them into the application. Mapping coverage is therefore limited to those sources the user installs and data freshness is only ensured by how often the user refreshes the source files. Furthermore, although it does provide an API to integrate some functionality in other applications, it does require that a local installation be maintained.
Our goal in starting this project was to build a service that would meet the following requirements:
the ability to map sequences as well as protein identifiers;
identifiers could come from multiple sources in one request;
identifiers could be mapped to multiple destination databases in one request;
mappings could be done interactively as well as programmatically;
mappings could be limited to specific taxon identifiers or across all species;
mappings could handle identifiers deleted from source databases but still available in result sets and the scientific literature;
mappings could be done against all primary protein data sources;
mappings could be done against most other protein data sources.
The first users of this service will be the Proteomics Identifications Database (PRIDE) [16, 17] and the IntAct Database , to simplify the task of mapping large scale proteomics and interaction experiments to a common reference system. However, by implementing the abovementioned requirements, we would provide the most powerful, comprehensive and versatile public service for mapping protein identifiers across different data sources to the scientific community at large.
To improve performance, database connection pooling (DBCP) is done using the Apache Commons DBCP  API at the data layer and caching is done where possible using the OpenSymphony Cache  API. Logging is done using Log4J  and real-time error reporting and user notification is done using the JavaMail  API.
CrossReference objects contain the description of the source database they originate from, the accession number and version of the entry, a status flag indicating if the entry is active (i.e. still available in the source database release files) or inactive (i.e. deleted from the source database), the date the entry was first loaded into UniParc as well as additional information such as the NEWT  taxonomy id (if available), the corresponding NCBI gi number (if available) and the date the entry was last loaded (if still active) or the date the entry was deleted (if such is the case).
Results and discussion
Data available in UniParc
Number of Releases
Number of Entries
EMBL Nucleotide Sequence Database
Whole Genome Shotgun
Annotated CON entries
Third Party Annotation
Ensembl Dasypus novemcinctus
Ensembl Otolemur garnettii
Ensembl Felis catus
Ensembl Caenorhabditis briggsae
Ensembl Caenorhabditis elegans
Ensembl Gallus gallus
Ensembl Pan troglodytes
Ensembl Ciona intestinalis
Ensembl Sorex araneus
Ensembl Bos taurus
Ensembl Canis familiaris
Ensembl Loxodonta africana
Ensembl Erinaceus europaeus
Ensembl Drosophila melanogaster
Ensembl Fugu rubripes
Ensembl Cavia porcellus
Ensembl Echinops telfairi
Ensembl Apis mellifera
Ensembl Homo sapiens
Ensembl Oryzias latipes
Ensembl Myotis lucifugus
Ensembl Anopheles gambiae
Ensembl Mus musculus
Ensembl Monodelphis domestica
Ensembl Ornithorhynchus anatinus
Ensembl Oryctolagus cuniculus
Ensembl Rattus norvegicus
Ensembl Macaca mulatta
Ensembl Spermophilus tridecemlineatus
Ensembl Gasterosteus aculeatus
Ensembl Tetraodon nigroviridis
Ensembl Tupaia belangeri
Ensembl Xenopus tropicalis
Ensembl Aedes aegypti
Ensembl Danio rerio
European Patent Office
International Protein Index
Japan Patent Office
Protein Data Bank
Protein Research Foundation
RefSeq release + updates
REFSEQ Homo sapiens
REFSEQ Mus musculus
REFSEQ Rattus norvegicus
REFSEQ Danio rerio
SWISS-PROT alternative splicing
TAIR Arabidopsis thaliana
TrEMBL alternative splicing
TROME Caenorhabditis elegans
TROME Drosophila melanogaster
TROME Homo sapiens
TROME Mus musculus
UniProt Metagenomic and Environmental Sequences
US Patent and Trademark Office
Vega Canis familiaris
Vega Homo sapiens
Vega Mus musculus
Vega Danio rerio
Mapping by sequence
Once a sequence is submitted for mapping, a CRC64 checksum is computed for that sequence and is used to quickly and efficiently query the Protein table of UniParc. Mappings are done on the basis of 100% sequence identity over the whole sequence. Subsequence matches are not considered as valid mappings as they will not generate identical CRC64 values. If no entries are found, the sequence cannot be mapped. If multiple entries are found, due to checksum collisions, the sequences are retrieved from UniParc and only the matching one is kept. CRC64 collisions are very rare but will occur, given the sequence volume of UniParc. At time of writing, 0.000115% of the total number of sequences have CRC64 collisions.
A UPEntry object is created and the UPI, sequence and timestamps fields are populated. The UPI of the correctly identified sequence is used to retrieve the Xref entries associated with that sequence, based on the search criteria. These criteria include the selected databases to map to, the possibility to retrieve all mappings (including inactive or deleted cross-references) or only active ones and the possibility to limit mappings to a selected species. The entries obtained from the Xref table will then be used to create CrossReference objects and will be added to the IdenticalCrossReference collection of the UPEntry object as they are all based on 100% sequence identity.
If the submitted sequence happens to have an active UniProt (SwissProt or TREMBL) cross-reference, additional data is looked up in a separate table in the UniParc schema. This supplementary information table will contain additional information extracted from the current UniProt release files, including secondary identifiers, UniProt IDs (e.g. JAD1A_HUMAN for the protein whose accession number is P29375) and cross-references maintained by UniProt to data sources available in UniParc. These human-annotated (SwissProt) and automatically-derived (TREMBL) cross-references can provide added value as the mappings they provide, while valid, might be to sequences that are different to the main UniProt sequence (such as splice variants, sequencing errors, natural variations, etc). Such mappings would not normally have been available via UniParc unless the exact variant sequence was queried. However, since they may not represent the exact sequence, it was decided to keep them separated from those obtained based on sequence identity. As such, CrossReference objects created from those records are stored in the LogicalCrossReference collection of the UPEntry. Logical CrossReference data will also be filtered according to the search criteria (selected databases, activity status, taxonomy annotation).
Querying with taxonomy restrictions was designed to be pessimistic. While taxonomy annotation coverage is improving in UniParc, many databases do not provide taxonomy information. Xrefs entries that are not annotated with taxonomy information or are not an exact match to the query parameter will not be included in the search results.
Mapping by accession
Mapping by protein identifier uses similar logic as that described above, but with a different starting point. If a protein accession is submitted, the supplementary information and Xref tables are queried to obtain all pertinent UPIs.
A UPEntry is created for each UPI and the relevant fields are populated from data gathered in the Protein table. The CrossReference collections of each UPI are then populated using the mechanisms described above. If a NCBI gi number is submitted (gi|1710032), the Xref table is queried as a starting point. However, gi number coverage is still low with respect to the overall number of entries in UniParc at only 41.5% at time of writing. If a gi number is not in UniParc, PICR will query the NCBI eUtilities  to obtain the corresponding sequence and use that as a starting point for mapping by sequence, as described above.
Using PICR to map PRIDE identifications
90% of PRIDE identifications can be mapped to one or more UPEntry. Of the remaining 10% of identifications that are unmapped, less than 1% come from unresolved or badly formatted identifiers (including a large proportion of deprecated UniProt IDs, which are notoriously difficult to track once they are removed from circulation). The majority of the unmapped identifications originate from proprietary databases, for which the protein sequences have not been provided, or other databases not available in UniParc (mostly model organism gene and transcript identifiers). As such, most of the unmapped identifiers would have been difficult, if not impossible, to map with other available tools.
Using the web interface
Users can refine their search by changing values in the Input Parameters section. By default, PICR will only return active protein mappings across all species but it is possible to limit queries by taxonomy or expand them to include non-active mappings. To retrieve both active and non-active mappings, uncheck the 'Return only active mappings' box. To limit the mappings to a particular species, select the desired option from the 'Limit by species' menu. This menu contains the most common species present in UniParc, though over 140,000 distinct taxonomy ids are currently annotated in UniParc. If users wish to limit their searches to a species which is not predefined in the menu, they can type the organism name in the field provided.
If species are entered both in the selection menu and in the search box, the search box will take precedence. It must be noted that although we have tried to get the maximum taxonomical coverage for the mappings, some source databases do not provide taxonomy information and, as such, those mappings cannot be properly assigned to a taxon and will therefore be excluded from any search that is limited by taxonomy.
The next step involves selecting the databases the user wishes to map the input data to by updating the selections in the Mapping Databases section of the search form. To keep the interface light and simple, some mapping options actually refer to more than one database. For example, selecting Ensembl will query all the organism-specific Ensembl releases, as is the case for RefSeq, Vega  and Trome . Selecting Swissprot and TREMBL will also include the respective splice variant databases .
Generating the mappings is a computationally intensive process which may require calls to external services and can therefore take some time. To give the user interactive feedback on the status of the search in progress, a progress bar will be displayed on the screen as the search is processed and is updated, every second, using AJAX. When the search is complete, the results will be displayed on the screen or a file download dialog box will appear, depending on the selected options.
Users can submit any number of protein accessions or sequences to be mapped at a time. However, if more than 500 are submitted in one request, the user will be prompted to enter a valid email address and must select one of the file output formats (CSV or XLS). Once the search is done, an email is sent to the user providing a URL to download the generated result file.
Using the SOAP and REST interfaces
PICR provides a publicly available SOAP web service to perform mappings. The service is encoded in the document/literal style for maximal interoperability. It is implemented in Java and deployed using JAX-WS to adhere with the latest WS-I specifications. Detailed developer documentation describing the SOAP service, as well as the WSDL descriptor file and sample Java client code examples are available online from the PICR website .
Representational State Transfer (REST) allows data elements to be associated with a well-formed URL. The same methods that are available in the SOAP interface are also available using the REST interface, with minor modifications to the parameters. Developer documentation on how to build valid REST queries is available online from the PICR website .
Resolving protein identifiers from multiple data sources is a difficult problem and there was no existing solution generic enough to suit our needs. As such, we have created a powerful and flexible system that allows for the batch querying of protein identifiers and sequences against multiple data sources using the most comprehensive protein sequence data archive available.
Mappings can be limited by source database or taxonomic classification and the results can include data no longer available in source databases. This last feature is particularly useful when dealing with old data sets and literature citations.
We offer three distinct query interfaces: one interactive and two programmatic. The interactive web interface uses AJAX to enhance the browsing experience wherever possible and provides the possibility to obtain results in four different formats: simple HTML, detailed HTML, XLS and CSV. Users and application developers can query SOAP and REST interfaces programmatically to integrate PICR functionality in their applications or perform batch requests.
Our application will provide a valuable service to wide areas of the scientific community and plans are already underway to build on its success. Future work will include improving the gi number coverage with UniProt sequences. We are in communication with the NCBI to obtain daily up-to-date gi number to UniProtKB accession number mapping files, which will be incorporated into the UniParc data warehouse and made available via PICR. Furthermore, we plan to implement a similarity search to UniProt sequences. The mapping algorithm as presently available will be expanded such that users will be able to submit protein identifiers or sequences and obtain mappings to SwissProt and TREMBL based on a user-defined similarity threshold.
The application is freely available to use. Clients and code examples are available online under the Apache Open Source 2.0 License.
Availability and requirements
Project name: Protein Identifier Cross-Reference Service
Project home page: http://www.ebi.ac.uk/Tools/picr
WSDL service descriptor: http://www.ebi.ac.uk/Tools/picr/service?wsdl
SOAP client demo: http://www.ebi.ac.uk/Tools/picr/client/picr_demo.zip
Operating system(s): Platform independent
Programming language: Java
Other requirements: Java 1.5 or later, Apache Ant 1.6 or later
License: Apache License 2.0
Any restrictions to use by non-academics: none
Application Programming Interface
Cyclic Redundancy Check
National Center for Biotechnology Information
Hyper Text Mark-up Language
Protein Identifier Cross-Referencing service
REpresentational State Transfer
Simple Object Access Protocol
Universal Protein database Archive
UniParc Protein Identifier
Extensible Mark-up Language.
PICR contributors are supported through the BBSRC ISPIDER grant and EU FP6 "Felics" (contract number 021902 (RII3)) grants. RC would like to thank KC for invaluable contributions.
- The UniProt Consortium: The Universal Protein Resource (UniProt). Nucleic Acids Res 2007, (35 Database):D193–7. Epub 2006 Nov 16, PMID: 17142230 Epub 2006 Nov 16, PMID: 17142230 10.1093/nar/gkl929
- Hubbard TJ, et al.: Ensembl 2007. Nucleic Acids Res 2007, (35 Database):D610–7. Epub 2006 Dec 5, PMID: 17148474 Epub 2006 Dec 5, PMID: 17148474 10.1093/nar/gkl996Google Scholar
- Pruitt KD, Tatusova T, Maglott DR: NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 2007, (35 Database):D61–5. Epub 2006 Nov 27, PMID: 17130148 Epub 2006 Nov 27, PMID: 17130148 10.1093/nar/gkl842Google Scholar
- Clark T, Martin S, Liefeld T: Globally distributed object identification for biological knowledgebases. Brief Bioinform 2004, 5(1):59–70. PMID: 15153306 PMID: 15153306 10.1093/bib/5.1.59View ArticlePubMedGoogle Scholar
- Babnigg G, Giometti CS: A database of unique protein sequence identifiers for proteome studies. Proteomics 2006, 6(16):4514–22. PMID: 16858731 PMID: 16858731 10.1002/pmic.200600032View ArticlePubMedGoogle Scholar
- Boehm AM, Sickmann A: A comprehensive dictionary of protein accession codes for complete protein accession identifier alias resolving. Proteomics 2006, 6(15):4223–6. PMID: 16888720 PMID: 16888720 10.1002/pmic.200600018View ArticlePubMedGoogle Scholar
- Alibes A, Yankilevich P, Canada A, Diaz-Uriarte R: IDconverter and IDClight: conversion and annotation of gene and protein IDs. BMC Bioinformatics 2007 Jan 10; PMID: 17214880 2007 Jan 10; PMID: 17214880Google Scholar
- caBIG GeneConnect[https://cabig.nci.nih.gov/tools/GeneConnect/]
- PIR ID Mapping[http://pir.georgetown.edu/pirwww/search/idmapping.shtml]
- Bussey KJ, Kane D, Sunshine M, Narasimhan S, Nishizuka S, Reinhold WC, Zeeberg B, Ajay W, Weinstein JN: MatchMiner: a tool for batch navigation among gene and gene product identifiers. Genome Biol 2003, 4(4):R27. Epub 2003 Mar 25, PMID: 12702208 Epub 2003 Mar 25, PMID: 12702208 10.1186/gb-2003-4-4-r27PubMed CentralView ArticlePubMedGoogle Scholar
- Schmidt T, Frishman D: PROMPT: a protein mapping and comparison tool. BMC Bioinformatics 7: 331. 2006 Jul 4, PMID: 16817977 2006 Jul 4, PMID: 16817977 10.1186/1471-2105-7-331Google Scholar
- Martens L, Hermjakob H, Jones P, Adamski M, Taylor C, States D, Gevaert K, Vandekerckhove J, Apweiler R: PRIDE: the proteomics identifications database. Proteomics 2005, 5(13):3537–45. Erratum in: Proteomics. 2005 Oct;5(15):4046 Erratum in: Proteomics. 2005 Oct;5(15):4046 10.1002/pmic.200401303View ArticlePubMedGoogle Scholar
- Jones P, Cote RG, Martens L, Quinn AF, Taylor CF, Derache W, Hermjakob H, Apweiler R: PRIDE: a public repository of protein and peptide identifications for the proteomics community. Nucleic Acids Res (34 Database):D659–63. 2006 Jan 1, PMID: 16381953 2006 Jan 1, PMID: 16381953Google Scholar
- Kerrien S, Alam-Faruque Y, Aranda B, Bancarz I, Bridge A, Derow C, Dimmer E, Feuermann M, Friedrichsen A, Huntley R, Kohler C, Khadake J, Leroy C, Liban A, Lieftink C, Montecchi-Palazzi L, Orchard S, Risse J, Robbe K, Roechert B, Thorneycroft D, Zhang Y, Apweiler R, Hermjakob H: IntAct–open source resource for molecular interaction data. Nucleic Acids Res 2007, (35 Database):D561–5. Epub 2006 Dec 1, PMID: 17145710 Epub 2006 Dec 1, PMID: 17145710 10.1093/nar/gkl958Google Scholar
- Leinonen R, Diez FG, Binns D, Fleischmann W, Lopez R, Apweiler R: UniProt archive. Bioinformatics 20(17):3236–7. 2004 Nov 22; Epub 2004 Mar 25, PMID: 15044231 2004 Nov 22; Epub 2004 Mar 25, PMID: 15044231 10.1093/bioinformatics/bth191Google Scholar
- The Java API[http://java.sun.com/]
- JAXB Reference Implementation[https://jaxb.dev.java.net/]
- The Apache Struts Web Application Framework[http://struts.apache.org/1.2.9/]
- JAX-WS Reference Implementation[https://jax-ws.dev.java.net/]
- Apache Commons DBCP[http://jakarta.apache.org/commons/dbcp/]
- OpenSymphony Cache[http://www.opensymphony.com/]
- Log4J Logging Services[http://logging.apache.org/log4j/docs/]
- The JavaMail API[http://java.sun.com/products/javamail/]
- Phan IQ, Pilbout SF, Fleischmann W, Bairoch A: NEWT, a new taxonomy portal. Nucleic Acids Res 31(13):3822–3. 2003 Jul 1, PMID: 12824428 2003 Jul 1, PMID: 12824428 10.1093/nar/gkg516PubMed CentralView ArticlePubMedGoogle Scholar
- NCBI eUtilities[http://www.ncbi.nlm.nih.gov/entrez/query/static/esoap_help.html]
- Cote RG, Jones P, Apweiler R, Hermjakob H: The Ontology Lookup Service, a lightweight cross-platform tool for controlled vocabulary queries. BMC Bioinformatics 7: 97. 2006 Feb 28, PMID: 16507094 2006 Feb 28, PMID: 16507094 10.1186/1471-2105-7-97Google Scholar
- Ashurst JL, Chen CK, Gilbert JG, Jekosch K, Keenan S, Meidl P, Searle SM, Stalker J, Storey R, Trevanion S, Wilming L, Hubbard T: The Vertebrate Genome Annotation (Vega) database. Nucleic Acids Res (33 Database):D459–65. 2005 Jan 1, PMID: 15608237 2005 Jan 1, PMID: 15608237Google Scholar
- Sperisen P, Iseli C, Pagni M, Stevenson BJ, Bucher P, Jongeneel CV: trome, trEST and trGEN: databases of predicted protein sequences. Nucleic Acids Res (32 Database):D509–11. 2004 Jan 1, PMID: 14681469 2004 Jan 1, PMID: 14681469Google Scholar
- Kersey P, Hermjakob H, Apweiler R: VARSPLIC: alternatively-spliced protein sequences derived from SWISS-PROT and TrEMBL. Bioinformatics 2000, 16(11):1048–9. PMID: 11159319 PMID: 11159319 10.1093/bioinformatics/16.11.1048View ArticlePubMedGoogle Scholar
- PICR SOAP developer documentation[http://www.ebi.ac.uk/Tools/picr/WSDLDocumentation.do]
- PICR REST developer documentation[http://www.ebi.ac.uk/Tools/picr/RESTDocumentation.do]
- PICR main search page[http://www.ebi.ac.uk/Tools/picr]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.