The Protein Identifier Cross-Referencing (PICR) service: reconciling protein identifiers across multiple source databases

Côté, Richard G; Jones, Philip; Martens, Lennart; Kerrien, Samuel; Reisinger, Florian; Lin, Quan; Leinonen, Rasko; Apweiler, Rolf; Hermjakob, Henning

doi:10.1186/1471-2105-8-401

Re: The Protein Identifier Cross-Reference (PICR) service: reconciling protein identifiers across multiple source databases

Eric Jain, Swiss Institute of Bioinformatics

30 October 2007

"Redundant databases may even assign multiple identifiers to the same sequence."

Keep in mind that some databases such as UniProtKB/Swiss-Prot are "redundant" on purpose, i.e. sequences are considered specific to organisms. At the same time a single identifier may be used to describe several splice variants etc. If the goal is to create a true "general purpose" mapping service, you'd have to allow people to map both at the conceptual level as well as at the sequence level. PICR looks like it could be real useful for people who need to do database mapping using exact sequence matches, but it should not be assumed that that's what most people want to do!

"Unified identifier schemes have been proposed in the past, such as Life Science Identifiers (LSID) and Sequence Globally Unique Identifiers (SEGUID), but their adoption remains limited."

Identifier schemes address issues such as how to avoid collisions and how to resolve (or not) identifiers, but they do not address the mapping issue! The LSID scheme, for example, has no mechanism to prevent several organizations from assigning different identifiers to the same sequence.

"The ID Mapping service offered by Protein Information Resource (PIR) has limited functionality in that it can only map between two sources per request, meaning that if the user wishes to map proteins from SGD, IPI and Genbank to UniProt, three requests must be made"

PIR's mapping service does support mapping from multiple sources (though the mapping is always *to* a single source, and I'm not sure the web form supports this).

"Also, not all mappings are available. For example, it is possible to map from SGD to UniProt [..] but not from SGD to Genbank."

This is supported, but since the mapping is provided by UniProtKB (in collaboration with SGD) it may not be complete (but note that a pure sequence-based mapping is likely to miss mappings as well, unless of course what you want really is a pure, sequence-based mapping).

May also be worth pointing out that while PICR lists 21 databases, PIR's mapping service supports more than 100! (see interface at http://beta.uniprot.org/mapping/).

One shortcoming of PIR's mapping services is performance, especially when mapping large sets of several thousand identifiers. Here it would be interesting to see some benchmark numbers!

"We are in communication with the NCBI to obtain daily up-to-date gi number to UniProtKB accession number mapping files, which will be incorporated into the UniParc data warehouse and made available via PICR."

GI numbers have been in UniParc for a while now?

Competing interests

None declared

Re: The Protein Identifier Cross-Reference (PICR) service: reconciling protein identifiers across multiple source databases

Eric Jain, Swiss Institute of Bioinformatics

30 October 2007

"Redundant databases may even assign multiple identifiers to the same sequence."
Keep in mind that some databases such as UniProtKB/Swiss-Prot are "redundant" on purpose, i.e. sequences are considered specific to organisms. At the same time a single identifier may be used to describe several splice variants etc. If the goal is to create a true "general purpose" mapping service, you'd have to allow people to map both at the conceptual level as well as at the sequence level. PICR looks like it could be real useful for people who need to do database mapping using exact sequence matches, but it should not be assumed that that's what most people want to do!
"Unified identifier schemes have been proposed in the past, such as Life Science Identifiers (LSID) and Sequence Globally Unique Identifiers (SEGUID), but their adoption remains limited."
Identifier schemes address issues such as how to avoid collisions and how to resolve (or not) identifiers, but they do not address the mapping issue! The LSID scheme, for example, has no mechanism to prevent several organizations from assigning different identifiers to the same sequence.
"The ID Mapping service offered by Protein Information Resource (PIR) has limited functionality in that it can only map between two sources per request, meaning that if the user wishes to map proteins from SGD, IPI and Genbank to UniProt, three requests must be made"
PIR's mapping service does support mapping from multiple sources (though the mapping is always *to* a single source, and I'm not sure the web form supports this).
"Also, not all mappings are available. For example, it is possible to map from SGD to UniProt [..] but not from SGD to Genbank."
This is supported, but since the mapping is provided by UniProtKB (in collaboration with SGD) it may not be complete (but note that a pure sequence-based mapping is likely to miss mappings as well, unless of course what you want really is a pure, sequence-based mapping).
May also be worth pointing out that while PICR lists 21 databases, PIR's mapping service supports more than 100! (see interface at http://beta.uniprot.org/mapping/).
One shortcoming of PIR's mapping services is performance, especially when mapping large sets of several thousand identifiers. Here it would be interesting to see some benchmark numbers!
"We are in communication with the NCBI to obtain daily up-to-date gi number to UniProtKB accession number mapping files, which will be incorporated into the UniParc data warehouse and made available via PICR."
GI numbers have been in UniParc for a while now?

Competing interests

None declared

Archived Comments for: The Protein Identifier Cross-Referencing (PICR) service: reconciling protein identifiers across multiple source databases

Re: The Protein Identifier Cross-Reference (PICR) service: reconciling protein identifiers across multiple source databases

Competing interests

BMC Bioinformatics

Contact us