Data elements
Origin
Our test set consists of DEs extracted from eleven Web-accessible biomedical sources, selected to be representative of the different kinds of resources found in the biomedical domain. Some of them contain information about genes: GeneCards [13], Entrez Gene, Geneloc [14], Genew (the HGNC [15] database), and HGMD [16], others about proteins: Swiss-Prot [17], PDB, HPRD, Interpro [18] or diseases: OMIM [19]. Our application is not targeted to a particular model organism so we also included MGI [20], which provides various kinds of information about mice.
Extracting data elements
Creating a set of terms for querying sources
In order to query the various data sources mentioned above, we first established a list of query terms, namely gene and disease names. To this end, we exploited a reference resource in the domain of medical genetics: the Genetics Home Reference [21] (GHR). GHR provides information about genetic conditions and genes involved in these conditions. Using the Web interface to GHR, a bioinformatician (FM) manually constituted a text file containing gene symbols (e.g. HFE) and associated disease names (e.g. hemochromatosis), if any. A sample of one hundred terms randomly extracted from this file constitutes the set of terms we used for querying DE sources.
Acquiring DEs
The sources used in this study are Web-interfaces to biological databases, automatically generated by program. Therefore, it is expected that most pages of a given source share a common organization and presentation. We take advantage of this feature for identifying recurring terms throughout Web pages, which, we hypothesize, correspond to DEs. In practice, we developed a program for querying systematically the eleven sources through their query URL. For each source, a set of 100 HTML pages corresponding to entries of the set of biomedical terms is created. After eliminating the header and footer, the elements common to at least 75% of the HTML pages are extracted automatically. This selection results in eliminating specific information (e.g., a given gene name), while keeping general information (e.g., the term "Gene Name") [22]. An example of DE extracted from the source Genew is given in Figure 1. For instance, the terms "Approved Symbol" and "Approved Name" appear on all three pages and are therefore identified as candidate DEs.
Terminological resources
A biomedical controlled terminology: the UMLS
We chose the Unified Medical Language System® (UMLS®) [23], a biomedical terminology integration system, because it provides a wide coverage of the biomedical domain, including terminologies for specialized clinical disciplines, the biomedical literature, and genome annotations. The UMLS consists of three major components. The UMLS Metathesaurus is assembled by integrating more than 100 sources vocabularies. It contains about 1.2 million concepts (clusters of synonymous terms) and more than 22 million relationships between these concepts. The UMLS Semantic Network is a limited network of 135 semantic types. Each Metathesaurus concept is assigned to at least one semantic type. Finally, the Lexical Resources comprise the SPECIALIST Lexicon and Lexical Tools [24]. The UMLSKS Developer's API also provides various methods for identifying Metathesaurus concepts from input terms (exact and normalized match). Additionally, the MetaMap Transfer (MMTx) program maps text to concepts in the Metathesaurus with additional flexibility (approximate match) [25]. The 2005AA version of the UMLS is used in this study.
A biomedical collection of data elements: the NCI caDSR
The National Cancer Institute (NCI) has created a Cancer Data Standards Registry (caDSR) [26] as part of the caCORE, a common infrastructure for cancer informatics [27]. Its main goal is to define a comprehensive set of standardized metadata descriptors for cancer research terminology used in information collection and analysis. Various NCI offices and partner organizations have developed the content of the caDSR by registration of DEs based on data standards, data collection forms, databases, clinical applications, data exchange formats, UML models, and vocabularies. Using the ISO/IEC 11179 [28] model for metadata registration, information about names, definitions, permissible values, and semantic concepts for common data elements (CDEs) have been recorded. In this study, we used the version 3.0.1.2 of the NCI caDSR, which comprises some 13,000 CDEs.
Method
Our method can be summarized as follows. Starting from the DEs automatically extracted from eleven Web resources, we first attempt to find a direct correspondence between our DEs and biomedical terms in the UMLS on the one hand and existing CDEs in the NCI caDSR on the other. Alternatively, we map the values corresponding to our DEs to the UMLS and expect to determine the type of the DE using the semantic types of the terms corresponding to the DE values. More formally, we first apply lexical methods in order to map DEs extracted from distinct sources to common vocabularies by exploiting the schema level. We then apply lexical methods at the instance level and we use external resources to enhance, filter and precise DE mappings.
Direct mapping of data elements to terminological resources
Mapping to the UMLS Metathesaurus
Our approach to mapping DEs to UMLS concepts is as conservative as possible. We first attempt to find an exact match. If none is found, a match is attempted after normalization. In practice, this process makes the input and target terms potentially compatible by eliminating such inessential differences as inflection, case, underscore and hyphen variations, as well as word-order variation [24]. These two steps are implemented by the corresponding methods of the UMLSKS API. Finally, an approximate match is attempted using MMTx (strict model). The mapping procedure is fully automated and stops as soon as a match is found. The output of the mapping consists of the list of Metathesaurus concepts corresponding to each DE, along with their semantic types.
Mapping to the NCI caDSR
The procedure used to map DEs to the caDSR is somewhat similar to the mapping to the UMLS. The major difference is that we used a local copy of the caDSR instead of the tools provided by the NCI. This gives us additional control over the mapping process. The caDSR repository consists in twelve fields. Half of them contain numbers and other data types unlikely to map to DEs, e.g. CDE identifiers such as "2178687". Four other fields are incomplete or contain information in natural language (such as a CDE definition "The name of the gene"), they are thus difficult to exploit. In practice, out of the twelve fields in a caDSR record, only two are of interest for our purpose: "Long Name" and "Preferred Name". The corresponding values of these two fields for the CDE "Gene Name" are "GeneName" and "Name", respectively. We rendered input terms and caDSR CDEs compatible by removing spaces in multi-word terms in order to match the naming conventions in the caDSR. We first try to map exactly each DE against the Preferred Names of the caDSR. In case of failure, we attempt an exact match to the Long Names of the caDSR CDEs. Additionally, we split each multi-word DE not yet mapped to the caDSR and attempt an exact match against the Preferred Names of the CDEs, followed by an approximate match. Finally, we attempt to map exactly the isolated words from DEs to the Long Names of the caDSR CDEs. This process is also fully automated and results in a list of DEs associated with the Long Name or Preferred Name of the mapped CDE(s).
Indirect mapping of data elements through their values
The approaches presented in the previous section are efficient to associate DEs with lexically similar entries in the terminological resources, but they are limited to those cases where lexically similar terms exist on both sides. The alternative approach proposed here consists in mapping not the DEs, but the values associated with them to terminological resources. This indirect mapping is attempted for all DEs because the objective of the proposed approach is twofold: On the one hand, to identify mappings for those DEs for which no match in the UMLS or caDSR can be found; on the other, to filter out potential inappropriate mappings obtained through the UMLS or the caDSR. For instance, the DE Approved Name in Genew will be mapped to the DE Protein Name in SwissProt because they share the word "Name". This is incorrect because Approved Name actually refers to gene, not protein names. In practice, it is expected that the DEs will be found among the high-level categories characterizing their corresponding values. For example, values associated with the DE Approved Name include "tenascin XB", and "breast cancer 1, early onset" (see Fig. 1), categorized as Gene or Genome.
Acquiring DE values
We first created a program to automatically query each source and recovered the values associated with each DE identified in this source. We extracted automatically up to 100 values corresponding to each DE by querying the sources for each biomedical term of the set constituted as described in the paragraph Acquiring DEs of subsection Data elements. For example, the values associated with Function include "protein binding" and "enzyme regulator activity". In some cases, no value could be extracted for a given DE in a given source.
Mapping DE values to the UMLS
We used the automated methods described in the paragraph Mapping to the UMLS Metathesaurus above for mapping DE values to UMLS concepts, with the difference that only exact and normalized matches were used here. For example, protein binding was mapped to the concept "Protein Binding" (C0033618), categorized by the semantic type Molecular Function.
Extracting DE candidates
We used the semantic type(s) of the UMLS concepts resulting from the mapping of the values of a given DE to determine the type of this DE. More precisely, we selected the semantic type categorizing the majority of the concepts for a given set of values. For instance, in the example introduced previously, we are able to determine that the DE Approved Name relates to gene names since the majority of its values were categorized by the semantic type Gene or Genome (see Fig. 2.a).
Default indirect mapping through data element values and heuristics
When the previous process could not determine the type of a DE, we attempted to assign coarser predefined types. We first isolated DEs containing specific terms. For instance, when the terms "ID(s)" or "identifier" were found, the corresponding DE was typed as Identifier. Then, we analyzed the values characterwise and assigned the type Sequence to the DE when each of its non-empty values was a series of "A", "G", "C", and "T". Finally, the remaining DEs were typed as Integer or String according to their values. An example of the exploitation of DE values through heuristics is shown in Figure 2.b.
This indirect mapping associates a type with the DEs, which is often useful for disambiguating underspecified DEs and for filtering out potentially inappropriate mappings obtained by direct mapping to terminological resources. Additional mappings can also be identified by exploiting the type associated with DE values, when the DE itself cannot be found in existing terminological resources.