Mapping proteins to disease terminologies: from UniProt to MeSH
© Mottaz et al.; licensee BioMed Central Ltd. 2008
Published: 29 April 2008
Although the UniProt KnowledgeBase is not a medical-oriented database, it contains information on more than 2,000 human proteins involved in pathologies. However, these annotations are not standardized, which impairs the interoperability between biological and clinical resources. In order to make these data easily accessible to clinical researchers, we have developed a procedure to link diseases described in the UniProtKB/Swiss-Prot entries to the MeSH disease terminology.
We mapped disease names extracted either from the UniProtKB/Swiss-Prot entry comment lines or from the corresponding OMIM entry to the MeSH. Different methods were assessed on a benchmark set of 200 disease names manually mapped to MeSH terms. The performance of the retained procedure in term of precision and recall was 86% and 64% respectively. Using the same procedure, more than 3,000 disease names in Swiss-Prot were mapped to MeSH with comparable efficiency.
This study is a first attempt to link proteins in UniProtKB to the medical resources. The indexing we provided will help clinicians and researchers navigate from diseases to genes and from genes to diseases in an efficient way. The mapping is available at: http://research.isb-sib.ch/unimed.
Biomedical data available to researchers and clinicians have increased drastically over the last decade because of the exponential growth of knowledge in molecular biology. While this has led to the creation of numerous databases and information resources, the interoperability between the resources remains poor to date. One of the main problems lies in the fact that medical terminologies are scarcely used in molecular biology. For instance, while the UniProt Knowledgebase (UniProtKB) - the most comprehensive protein warehouse with extensive cross-references to other database resources  – contains more than 2,000 human proteins with manually curated information related to their involvement in pathologies, this information is not easily accessible for clinical researchers. This is due to the fact that UniProtKB does not use standard medical vocabularies to describe diseases associated to proteins and their variants.
In order to increase the interoperability between the biomolecular and clinical resources, one of the key solutions lies in the development or unification of common terminologies capable of acting as a metadata layer to provide the missing links between the various resources. In the medical/clinical domain, there have already been numerous and successful efforts to implement controlled vocabularies for pathologies. Terminologies such as MeSH - the controlled vocabulary thesaurus used for biomedical and health-related documents indexing , ICD-10 - the official disease classification provided by the World Health Organisation (WHO) for diagnostic information , and SNOMED-CT – the clinical terminology used for clinical information , have all served well in their respective domain of application. Most of these terminologies are collected and organised into concepts in the UMLS, a major repository of biomedical standard terminologies .
The recent integration of the Gene Ontology (GO)  into the UMLS, as well as the development of numerous biological ontologies under the Open Biological Ontologies initiative (OBO) , have opened new ways of linking biological and medical resources via terminologies. Therefore, terminology and ontology mapping has become an active field of research, the objective being identifying correspondence between concepts of different resources. The National Library of Medicine (NLM) made an important pioneer effort through the integration of more than 60 medical vocabularies in the UMLS Metathesaurus and the development of lexical tools for this purpose . In parallel, many approaches have been developed which integrate lexically-based, as well as knowledge- and semantics-based methods to map, for instance, GO terms to UMLS concepts [9, 10], representations of anatomy , genotypic and phenotypic data [12, 13]. In the biological field, identical initiatives are emerging for linking OBO ontologies . It was shown that the mapping could be improved by a combination of lexical alignments and hybrid mapping techniques which integrate structural properties of the ontologies. The most advanced tools for aligning and merging ontologies indeed take advantage of both the similarity between terms and the structural features of the resources.
In this study, we tested different automatic approaches to map the disease terms in UniProtKB to MeSH. The MeSH thesaurus is the NLM's controlled vocabulary for subject indexing in MEDLINE . It is structured in a hierarchy of descriptors, with each descriptor including a set of concepts, and each concept itself containing a set of terms, which are synonyms and lexical variants. This rich vocabulary is included in the UMLS and, therefore, is linked to many other biomedical terminologies. The mapping procedures described below took advantage of the manual annotation in UniProtKB as well as the curated links of UniProtKB entries to OMIM, a comprehensive knowledge base of human genes and genetic diseases . A benchmark set was created for the evaluation and refinement of term matching algorithms.
Overview of the mapping procedure
we extracted the disease names from the Swiss-Prot and OMIM entries;
for each disease name, we looked for an exact match with a MeSH term where all words composing the name had an identical correspondent in a MeSH term and vice versa;
when the previous step failed, we looked for partial matches by decomposing the name into its word components and calculate a similarity score with MeSH terms.
To define the whole procedure, a benchmark set was created for the evaluation and refinement of term matching algorithms. Different methods adapted from textual information retrieval techniques were tested. Namely, we evaluated the effect of linguistic pre-processing of the terms to get rid of word lexical variations (with/without normalisation). A method developed by Ha-Thuc and Srinivasan for gene name recognition was also tested .
The methods were assessed in term of retrieval, recall and precision, which measure the proportion of terms mapped among all terms, the proportion of terms correctly mapped among all terms, and the proportion of terms correctly mapped among mapped terms, respectively. A detailed description of the methodology is provided in the Methods section.
The benchmark set
We constructed a benchmark set consisting of 200 randomly selected diseases manually mapped to one or several MeSH terms. The principal problem encountered in this manual mapping process was the lack of specificity of MeSH in the field of genetic diseases. This means that only a quarter of the disease names (52) were mapped to a term of similar meaning. For the other 148 ones, we mapped to a term with coarser granularity and, for 90 of them, we had to choose more than one parent term since the same term could belong to several branches in the MeSH hierarchy. For instance, the disease name X-linked congenital idiopathic intestinal pseudoobstruction (P21333) was associated to the MeSH term Intestinal Pseudo-Obstruction. However, this term is in no way linked to a branch indicating the genetic origin of the disease. Therefore, we mapped the disease to two other coarser terms belonging to other hierarchies: Genetic Disease, X-Linked and Digestive System Abnormalities.
The manually mapped terms were used to evaluate the performance of automatic procedures described below.
Disease name extraction
“(CBL) can be converted to an oncogenic protein by deletions or mutations that disturb its ability to down-regulate RTKs.” (P22681)
By manually assessing the extraction results, we noticed that as the system was constructed to extract only a single disease name per line, it was unable to treat lines such as:
“KRT16 and KRT17 are coexpressed only in pathological situations such as metaplasias and carcinomas of the uterine cervix and in psoriasis vulgaris.” (P08779)
We did not investigate further these cases, as the structure of disease lines is scheduled for revision as part of Swiss-Prot annotation standardization efforts.
In parallel, we extracted disease names and synonyms from the 2,087 OMIM phenotypes (#) and genes with phenotypes (+) entries cited in the 2,601 Swiss-Prot disease lines. This corresponded to 82% of the total OMIM entries on phenotypes with a known molecular basis (v. August 2007).
Establishing the mapping procedure using the benchmark set
The 200 disease names of the benchmark set and their associated OMIM terms were automatically mapped to the “Diseases” and “Psychiatry and Psychology” categories of the MeSH (v. August 2007). This subset of MeSH consists of 43,220 different terms. The automatic mapping procedure was done independently on disease names from Swiss-Prot and from OMIM. Different techniques were evaluated to maximize the number of exact and partial term matches.
Briefly, the step consisted of transforming all terms into bag of words either with or without word normalisation. The word normalisation step was performed using the Norm program of the NLM . The effect of term pre-processing was found to be not significant on this dataset, the two procedures giving exactly the same results (Table 1, columns 1-3). All exact matches provided by Swiss-Prot disease names were correct. It was found that the coverage obtained using OMIM terms was better. This could be explained by the presence of synonyms for each disease, which increased matching opportunities. The presence of synonyms however also augmented the risk of possible incorrect mappings. Indeed, the only three false positive matches were caused by a difference of classification between MeSH and OMIM. For instance, two types of epidermolysis bullosa, which are distinct MeSH descriptors, are synonyms in OMIM. When we gathered the exact matches provided by Swiss-Prot and OMIM, the recall increased to 26%, with a precision of 96%. It should be noted that the overlap of disease mapping from the two resources did not necessarily mean that the matching terms were the same, but rather that they belonged to the same descriptor in the MeSH terminology.
The disease names not mapped by exact matches went through a partial matching procedure. For this, three separate procedures were tested in order to evaluate the effect of term pre-processing as well as the use of different scoring functions:
Procedure 1: Term pre-processing followed by calculation of a similarity score for matching terms based on an adaptation of the weighting schema ‘Term Frequency x Inverse Document Frequency’ (TFIDF) ;
Procedure 2: No term pre-processing followed by calculation of the same similarity score as in procedure 1;
Procedure 3: Use of the program developed by Ha-Thuc and Srinivasan .
The weighting schema TFIDF is commonly used in information retrieval techniques. This scoring method allows evaluate the informative content of a word in a collection or documents. Ha-Thuc and Srinivasan's program uses a different adaptation of TFIDF which allows partial matches at the word level [19, 20]. The method also takes advantage of synonymy resources to improve the similarity scoring by increasing the weights or words common to several synonyms.
The performance of the Ha-Thuc's synonym-based similarity scoring was slightly lower than the simpler scoring system we developed. This could be due to the fact that their program calculated a vector similarity measure using the cosine coefficient. Indeed, in a first attempt to set up a scoring schema, we noticed that the cosine coefficient was less effective on our data. It appears therefore that this similarity measure, although widely used in information retrieval from texts, is less efficient for terminology mapping.
Based on these evaluations, we decided to set up the complete mapping procedure using the scoring method we developed. The word normalisation pre-treatment was included in the procedure even though it did not result in a real gain of performance. The reason for this choice was due to our intention to map Swiss-Prot diseases to ICD-10, which does not include lexical resources. Therefore, a word normalization step could be essential.
Evaluation of the mapping of 200 UniProtKB/Swiss-Prot disease lines (173 with a reference to OMIM)
SP ∩ OMIM
SP ∪ OMIM
The mappings of the benchmark, both manual and automatic, are available in additional file 1.
Automatic mapping of UniprotKB/Swiss-Prot disease comment lines
Mapping on MeSH of the 3408 UniProtKB/Swiss-Prot disease lines (2601 with a corresponding OMIM entry)
SP ∩ OMIM
SP ∪ OMIM
As a first assessment, we checked if, in case of exact matches, corresponding Swiss-Prot and OMIM terms mapped to identical MeSH descriptors. This statement was confirmed in all but 17 cases. These discrepancies in descriptor matching were mainly due to differences in classification, with OMIM synonyms corresponding to distinct descriptors in MeSH. Another minor cause was the mention of multiple diseases in the UniProtKB/Swiss-Prot comment line. In these cases, the disease name with an OMIM reference was different from the one extracted.
In this study, we designed a mapping procedure to link the UniProtKB/Swiss-Prot human protein entries and the corresponding OMIM entries to the MeSH disease terminology. MeSH was chosen as it is interlinked with many biomedical terminologies within the UMLS. More importantly, its intimate association with literature will provide us with a valuable means for knowledge discovery using data-mining in the future.
To derive an efficient mapping procedure, alternative methods were tested in order to evaluate the effect of term pre-processing and the use of different similarity scoring systems. It was found that these methods did not differ drastically in terms of performance. Clearly, the benchmark dataset used for evaluation could be too small to draw definite conclusions. However, the fact that MeSH includes many lexical and orthographic term variations does provide an explanation for the low benefit obtained from term normalisation. On the other hand, as both MeSH and OMIM have synonym resources, the mapping procedure should have been improved with the Ha-Thuc's method which cleverly takes into account the word frequency in a set of synonyms. It is possible that the parameters used in Ha-Thuc's program, which was initially developed for gene name entity recognition in textual documents, need to be re-adjusted to better suit the purpose of terminology mapping.
The final mapping procedure we set up by combining exact and partial matches of disease names from OMIM and Swiss-Prot was able to provide a high precision mapping for more than half of the total number of disease comment lines in UniProtKB/Swiss-Prot. Although this retrieval could be considered as low for certain applications, it should be noted that stringent conditions were chosen on purpose to provide a high quality fully automated mapping procedure. If manual curation could be solicited, we could accept a reduced precision.
Recently, the same approach was used to map diagnosis-related annotations of tumor tissue microarrays to the NCI thesaurus  with better results (a mapping coverage of 86% and an estimated precision of 86%). These differences in performance could be simply explained by the richness of the domain-specific NCI-T vocabulary compared to the MeSH. Indeed, one of the main problems encountered in the mapping process lay in the difference of granularity between the terminologies, with MeSH being relatively coarse-grained for genetic diseases. Therefore, one strategy to increase the performance of the system would be to allow the mapping to less specific concepts. For instance, the system should be able to map the disease name, pyruvate dehydrogenase e3-binding protein deficiency, to its correct parent, pyruvate dehydrogenase complex deficiency disease, which currently had a similarity score below the threshold value. To achieve this, one can try to improve the word weighting in order to get rid of rare words without disease-related meaning, such as e3-binding protein . This can be done by considering either a common English word thesaurus or a greater biomedical resource, such as the whole MEDLINE database, for the word frequency calculation. More sophisticated linguistic methods could also be applied to analyse the syntactic and semantic structure of the term. Finally, it may be worth integrating information from the MeSH terminology structure in the score calculation as such a strategy has been successfully used for categorising OMIM phenotypes using MeSH terms .
Apart from the direct mapping strategy, preliminary work was done to evaluate several indirect mapping strategies that exploit the textual information provided by UniProtKB/Swiss-Prot and OMIM. The first method consisted in using a generic categorizer, XMap , to associate Swiss-Prot diseases comment lines with a ranked set of MeSH descriptors. The preliminary results on the benchmark were not convincing (data not shown). This is in agreement with other studies using MetaMap – a similar program developed by the NLM  - which reported that these complex methods did not outperform simpler heuristics such as ours in categorising structured database annotations [23, 24]. Nevertheless, the method could be more efficient on longer texts such as the OMIM disease description fields.
The second method consisted in using the textual information from the biomedical literature cited in Swiss-Prot and OMIM. Indeed MeSH is used to index MEDLINE documents and this information can be used to find the correct term. In a preliminary attempt, all disease MeSH terms in OMIM's citations were extracted and ranked according to their frequency. The precision for the first ranked terms was found to be 57%. The result was rather promising given the fact that the method was not based on term similarity. In future developments, we may consider using this complementary method in combination with the direct mapping.
Nevertheless, the problem of MeSH granularity will hardly be completely solved by these methods. We need definitely to explore the use of other medical terminology resources, such as ICD-10 or SNOMED-CT.
In conclusion, this work represents the first step in standardizing the medical vocabularies in the UniProt Knowledgebase. Through this effort, we provide a bridge for the medical informatics community to explore the genomic and proteomic data present in biological databases which could be of value for disease understanding.
Extraction of disease names
In UniProtKB/Swiss-Prot, disease information related to a protein entry is expressed in free text comment lines (category ‘Involvement in disease’). We proceeded by first manually establishing a list of regular expressions that indicated the presence of disease names within a Swiss-Prot comment line such as ‘cause(s)’, ‘cause of’, ‘involved in’, ‘contribute(s) to’. The expressions are listed in the additional file 3. The extraction of complete disease names was relatively easy as they are usually located at the end of a sentence or before a conjunction or a relative clause or directly followed by a corresponding OMIM identifier.
In parallel, the fields Title and Alternative titles; symbols were extracted from the cited OMIM entries. These two fields provide the disease names in OMIM as well as a set of synonyms. For names coming from “gene and phenotype (+)” entries, both gene names and diseases names were included in the disease list.
The mapping procedure was tested with and without word normalisation. The word normalisation was done using the program Norm from the lexical tools provided by the NLM . Norm removes stop words and plural forms, uninflects verbs, lowercases words etc. For the mapping without word normalisation, we simply lowercased the term components, removed punctuation signs and unspecific words such as “susceptibility to”, “development of” from the disease names extracted from Swiss-Prot (see additional file 3). The word “included” which qualifies a synonym of closely related meaning was also removed from OMIM Alternative titles. The terms were transformed into “bags of words”, without taking collocations into account, except for hyphenated words.
Where freq=n/N, with n the number of occurrence of the word in all OMIM (Titles, Alternative titles), MeSH terms (disease category) and Swiss-Prot disease comment lines, and N the total number of words in these documents. cw and ncw stand for words in common and not in common, respectively, between the two mapped terms, and size(disease) is a normalization factor consisting of the number of words composing the disease name to be mapped.
We also calculated term similarity using the program kindly provided by Ha-Thuc and Srinivasan . The implemented procedure uses a ‘soft’ TFIDF approach which introduces a character-based similarity between words [19, 20]. In addition, it takes into account the word frequencies in a set of synonym names by increasing the TF scores of words that are common to several synonyms of a disease name.
In order to evaluate the mapping procedure, 200 disease comment lines from 95 UniProtKB/Swiss-Prot entries were manually mapped to MeSH by a medical expert. Swiss-Prot entries were selected randomly. However, care was taken so that the chosen sample of entries would be representative and lead to a proportion of exact and partial matches similar to that found in a preliminary mapping attempt.
The mapping procedure was assessed in terms of precision, p=TP/(TP+FP) and recall, r=TP/total number of terms, where TP is the number of correct mapping (true positive) and FP is the number of incorrect mapping (false positives). Since the system was forced to retain only the best match, we considered, in case of diseases manually mapped to several MeSH terms, that the automatic mapping was correct if at least one of these terms was mapped.
The β value was set to 0.5 so as to favor the precision of the mapping.
This work was funded by the Swiss National Science Foundation (grant No 3100A0-113970). We are grateful to Viet Ha-Thuc who kindly provided us with his program. The authors also wish to thank Julien Gobeill for performing the preliminary indirect mappings using Xmap and Violaine Pillet for her comments on the manuscript.
This article has been published as part of BMC Bioinformatics Volume 9 Supplement 5, 2008: Proceedings of the 10th Bio-Ontologies Special Interest Group Workshop 2007. Ten years past and looking to the future. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/9?issue=S5.
- The UniProt Consortium: The Universal Protein Resource (UniProt) Nucleic Acids Res 2007, 35: D193-D197.PubMed CentralView ArticleGoogle Scholar
- Nelson SJ, Schopen M, Savage AG, Schulman JL, Arluk N: The MeSH Translation Maintenance System: Structure, Interface Design, and Implementation. Medinfo 2004, 11(Pt 1):67–69.Google Scholar
- International Statistical Classification of Diseases and Health Related Problems In (The) ICD-10. Second Edition edition. WHO Press, Geneva;Google Scholar
- Donnelly K, SNOMED-CT: The advanced terminology and coding system for eHealth. Stud Health Techno Inform 2006, 121: 79–90.Google Scholar
- Bodenreider O: The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res 2004, 32: D267-D270.PubMed CentralView ArticlePubMedGoogle Scholar
- Gene Ontology Consortium: The Gene Ontology (GO) project in 2006 Nucleic Acids Res 2006, 34: D322-D326.PubMed CentralView ArticleGoogle Scholar
- Ashburner M, Mungall CJ, Lewis SE: Ontologies for biologists: a community model for the annotation of genomic data. Cold Spring Harbor Symp Quant Biol 2003, 227–236.Google Scholar
- National Library of Medicine: UMLS Lexical Tools . [http://www.nlm.nih.gov/research/umls/tools.html]
- Sarkar IN, Cantor MN, Gelman R, Hartel F, Lussier YA: Linking biomedical language information and knowledge resources: GO and UMLS. Pac Symp Biocomput 2003, 439–450.Google Scholar
- Cantor MN, Sarkar IN, Gelman R, Hartel F, Bodenreider O, Lussier YA: An evaluation of hybrid methods for matching biomedical terminologies: Mapping the Gene Ontology to the UMLS. Stud Health Technol Inform 2003, 95: 62–67.PubMed CentralPubMedGoogle Scholar
- Zhang S, Mork P, Bodenreider O, Bernstein PA: Comparing two approaches for aligning representations of anatomy. Artif Intell Med 2007, 39: 227–236.PubMed CentralView ArticlePubMedGoogle Scholar
- Lussier YA, Li J: Terminological mapping for high throughput comparative biology of phenotypes. Pac Symp Biocomput 2004, 202–213.Google Scholar
- Cantor MN, Sarkar IN, Bodenreider O, Lussier YA: GenesTrace: Phenomic knowledge discovery via structured terminology. Pac Symp Biocomput 2005, 103–114.Google Scholar
- Johnson HL, Cohen KB, Baumgartner WA, Lu Z, Bada M, Kester T, Kim H, Hunter L: Evaluation of lexical methods for detecting relationships between concepts from multiple ontologies. Pac Symp Biocomput 2006, 28–39.Google Scholar
- Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA: Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res 2005, 33: D514–517.PubMed CentralView ArticlePubMedGoogle Scholar
- The Specialist Lexical Tools [http://lexsrv3.nlm.nih.gov/SPECIALIST/index.html]
- Shatkay H: Hairpins in a bookstacks: Information retrieval from biomedical text. Brief Bioinform 2005, 6: 222–38.View ArticlePubMedGoogle Scholar
- Ha-Thuc V, Srinivasan P: Exploiting synonym relationships in biomedical named entity matching. In BioLINK SIG 2007, ISMB/ECCB. Vienna; 2007. JulyGoogle Scholar
- Bilenko M, Mooney R, Cohen W, Ravikumar P, Fienberg S: Adaptive name matching in information integration. IEEE Intellig Sys. 2003, 18: 16–23.View ArticleGoogle Scholar
- Cohen W, Ravikumar P, Fienberg S: A comparison of string distance metrics. for name-matching tasks. Proc JCCAI Conf 2003, 73–78.Google Scholar
- Ruch P: Automatic assignment of biomedical categories: toward a generic approach. Bioinformatics 2006, 22: 658–664.View ArticlePubMedGoogle Scholar
- Aronson AR: Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. AMIA Annu SympProc 2001, 17–21.Google Scholar
- Butte AJ, Kohane IS: Creation and implications of a phenome-genome network. Nat Biotechnol 2006, 24: 55–62.PubMed CentralView ArticlePubMedGoogle Scholar
- Butte AJ, Chen R: Finding disease-related genomic experiments within an international repository: first steps in translational bioinformatics. AMIA Annu SympProc 2006, 106–110.Google Scholar
- Shah NH, Rubin DL, Espinosa I, Montgomery K, Musen MA: Annotation and query of tissue microarray data using the NCI Thesaurus. BMC Bioinformatics 2007, 8: 296.PubMed CentralView ArticlePubMedGoogle Scholar
- van Driel MA, Bruggeman J, Vriend G, Brunner HG, Leunissen JA: A text-mining analysis of the human phenome. Eur J Hum Genet 2006, 14: 535–542.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.