Ontology-driven indexing of public datasets for translational bioinformatics
© Shah et al. 2009
Published: 5 February 2009
Skip to main content
© Shah et al. 2009
Published: 5 February 2009
The volume of publicly available genomic scale data is increasing. Genomic datasets in public repositories are annotated with free-text fields describing the pathological state of the studied sample. These annotations are not mapped to concepts in any ontology, making it difficult to integrate these datasets across repositories. We have previously developed methods to map text-annotations of tissue microarrays to concepts in the NCI thesaurus and SNOMED-CT.
In this work we generalize our methods to map text annotations of gene expression datasets to concepts in the UMLS. We demonstrate the utility of our methods by processing annotations of datasets in the Gene Expression Omnibus. We demonstrate that we enable ontology-based querying and integration of tissue and gene expression microarray data. We enable identification of datasets on specific diseases across both repositories. Our approach provides the basis for ontology-driven data integration for translational research on gene and protein expression data.
Based on this work we have built a prototype system for ontology based annotation and indexing of biomedical data. The system processes the text metadata of diverse resource elements such as gene expression data sets, descriptions of radiology images, clinical-trial reports, and PubMed article abstracts to annotate and index them with concepts from appropriate ontologies. The key functionality of this system is to enable users to locate biomedical data resources related to particular ontology concepts.
The amount and diversity of genomic scale data has been steadily increasing for the past several years. This increase has enabled integrative translational bioinformatics studies across these datasets . Currently, the predominant genomic level data is gene expression microarrays. Recently, other forms of genomic scale measurements have been gaining acceptance, one of them being Tissue Microarrays. Tissue Microarrays allow for the immunohistochemical analysis of large numbers of tissue samples and are used for confirmation of microarray gene-expression results as well as for predictive pathology . A single tissue microarray (TMA) paraffin block can contain as many as 500 different tumors, enabling the screening of thousands of tumor samples for protein expression using a few array sections . Databases such as the Stanford Tissue Microarray Database (TMAD)  provide a central repository for data from TMA's akin to the Stanford Microarray Database (SMD) and Gene Expression Omnibus (GEO) for gene expression arrays.
Currently, it is difficult to integrate the results from gene expression microarrays and tissue microarrays data types and several reviews have suggested that it is essential to address this issue and synchronize the analysis, interpretation and data standards for these data [3, 5]. In order to develop integrative approaches to interpret gene and protein expression datasets, there is a strong and pressing need to be able to identify all experiments that study a particular disease. A key query dimension for such integrative studies is the sample, along with a gene or protein name. As a result, besides queries that identify all genes that have a function X – which can be reliably answered using the Gene Ontology (GO) – we need to conduct queries that find all samples/experiments that study a particular disease and/or the effect of an experimental agent. However, because of the lack of a commonly used ontology or vocabulary to describe the diagnosis, disease studied or experimental agent applied in a given experimental dataset it is not possible to perform such a query.
The challenge is to create consistent terminology labels for each experimental dataset that would allow the identification of all samples that are of the same type at a given level of granularity. (e.g., All carcinoma samples versus all Adenocarcinoma in situ of prostate samples, where the former is at a coarser level of detail). One mechanism of achieving this objective is to map the text-annotations describing the diagnoses, pathological state and experimental agents applied to a particular sample to ontology terms allowing us to formulate refined or coarse search criteria [6, 7]. Butte et al have previously applied a text-parser (GenoText) to determine the phenotypic and experimental context from text annotations of GEO experiments . They report that text-parsing is still an inefficient method to extract value from these annotations . In later work, Butte et al explored the use of PUBMED identifiers of the publication associated with GEO experiments and their assigned Medical Subject Headings (MeSH) to identify disease related experiments . They were able to relate 35% of PUBMED associated GEO series to human diseases. Only half or so of the GEO experiments have PUBMED identifiers and the remaining are inaccessible to this approach, possibly necessitating alternative methods .
We have previously developed methods to process such text-annotations for tissue microarrays and map them to concepts in the NCI thesaurus and the SNOMED-CT ontologies [9, 10]. In the current work we generalize our methods to process text annotations for gene expression datasets in GEO (as well as TMAD) and map them to concepts in the UMLS. We present results on the accuracy of our mapping effort and demonstrate how the mapping enables better query and integration of gene expression and protein expression data. We discuss the utility of our approach to derive integrative analyses. We believe that a similar integration problem exists for other kinds of data sources: for example, a researcher studying the allelic variations in a gene would want to know all the pathways that are affected by that gene, the drugs whose effects could be modulated by the allelic variations in the gene, and any disease that could be caused by the gene, and the clinical trials that have studied drugs or diseases related to that gene. The knowledge needed to study such questions is available in public biomedical resources; the problem is finding that information.
Currently most publicly available biomedical data are annotated with unstructured text and rarely described with ontology concepts available in the domains. The challenge is to create consistent terminology labels for each element in the public resources that would allow the identification of all elements that relate to the same type at a given level of granularity. These resource elements range from experimental data sets in repositories, to records of disease associations of gene products in mutation databases, to entries of clinical-trial descriptions, to published papers, and so on. Creating ontology-based annotations from the textual metadata of the resource elements will enable end users to formulate flexible searches for a wider range biomedical data besides just gene expression arrays and tissue arrays [6, 7, 9, 11]. Therefore, the key challenge is to automatically and consistently annotate the biomedical data resource elements to identify the biomedical concepts to which they relate. Expanding on our preliminary work with GEO datasets and tissue microarray annotations, we have built a prototype system for ontology based annotation and indexing of biomedical data in the BioPortal ontology repository . The system's indexing workflow processes the text metadata of several biomedical resource elements to annotate (or tag) them with concepts from appropriate ontologies and create an index to access these elements. The key functionality of this system is to enable users to find biomedical data resources related to particular ontology concepts.
The Gene Expression Omnibus (GEO) is an international repository of microarray data run by the National Center for Biotechnology Information (NCBI). In this work we use the November 2006 release of GEO, which contained 108371 samples, 4593 GEO experiment series and 1080 GEO datasets, 369 of which are human. In this analysis we focus only on the human datasets. Each GEO dataset has a title and a description field that contain text entered by the person uploading the dataset. Moreover, GEO datasets can have an additional 24 descriptors (such as agent, cell line, and species) along with their subset descriptions. In the current work we process the text from the title, description and agent descriptors of GEO datasets.
The Stanford Tissue Microarray Database (TMAD) contains data from immunohistochemical analyses performed with tissue microarrays. The TMAD provides tools for quick upload, storage and retrieval of the tissue microarray images and the analysis of immunohistochemical staining results . Each sample in the TMAD contains free-text annotations – entered by the experimenter – for fields such as the organ system, and up to five diagnosis terms (one principal diagnosis field and four sub diagnosis fields) describing the sample. We concatenate the text from these six fields and use it as the annotation of the sample for our work. We refer to experiments studying samples with the same diagnoses as one dataset. Currently, the TMAD has 10734 samples that can be grouped into 1045 datasets according to their diagnoses.
We downloaded UMLS 2006 AD and created a MySQL database using the Metamorphosys tool as described in the UMLS documentation .
In order to map existing annotations in TMAD and GEO to ontology terms, we used the UMLS-Query module developed by our group to process the existing descriptions of the samples and matching them to ontology terms. Fully describing the UMLS-Query module and all of its functionality is beyond the scope of this current work but we describe the key mapping function mapToId here. For each text-annotation, we read a sliding window of five words from the text. We generate all possible permutations (5-grams) of these words and look for an exact match to an ontology term. We examine all 5 word permutations because we observed that most disease and drug names are less than 5 words in length. This permutation based method, though accurate, can be made more efficient computationally. We are working on that collaboratively with the National Center for Integrative Biomedical Informatics . We restrict the matches to SNOMEDCT and the NCI thesaurus vocabularies when identifying disease names. The UMLS-Query module along with detailed documentation is available from . We do not employ any natural language processing strategies such as stemming, normalization or noun-phrase recognition. We also do not employ any heuristics or hacks for increasing match accuracy.
The result of the mapping is a table that associates each GEO or TMAD dataset identifier to one or more concepts in the UMLS. We query this table to identify disease related datasets as well as identify matching datasets from the repositories.
For our prototype we used a tool called mgrep , developed by University of Michigan. Mgrep implements a novel radix tree based data-structure that enables fast and efficient matching of text against a set of ontology terms. We use mgrep instead of UMLS-Query in our prototype because of mgrep is several times faster than UMLS-Query and the key idea implemented in the mapToId function of UMLS-Query is also implemented in mgrep.
We use relations contained in the source ontologies to expand the annotations. For example, using the is_a relation, for each term we create additional annotations according to the parent-child relationships of the original concept; a process we refer to as the transitive closure of the annotations. As a result, if a resource element such as a GEO expression study is annotated with the concept pheochromocytoma from the National Cancer Institute Thesaurus (NCIT), then a researcher querying for retroperitoneal neoplasms can find data sets related to pheochromocytoma. The NCIT provides the knowledge that pheochromocytoma is_a retroperitoneal neoplasms. Creation of annotations based on transitive closure step is done offline because, processing the transitive closure is very time consuming – even if we use a pre-computed hierarchy – and will result in prolonged response times for the users. This use case is similar, in principle, to query expansion done by search engine like Entrez; however, Entrez does not use the NCIT, therefore, even though pheochromocytoma related GEO data sets exist, none show up on searching for retroperitoneal neoplasms in Entrez. In our system, however, a researcher could search for retroperitoneal neoplasms and find the relevant samples.
At the user level, on searching for a specific ontology concept, the results provide resource elements annotated directly or via the step of transitive closure. The user receives the result in terms of references and links (URL/URI) to the original resource elements. The system architecture illustrates the generalizability of our implementation. Note the same model could be applied for domains other than biomedical informatics. The only specific components of the system are the resource access tools (which are customized for each resource) and, of course, the ontologies.
A GEO dataset represents a collection of biologically – and statistically – comparable samples processed using the same platform. We treat each GEO dataset as a resource element and process its metadata. Each GEO dataset has a title and a summary context that contain free text metadata entered by the person creating the dataset. For example the GEO dataset 'GDS1989', which is available online  and can be retrieved using the EUtils API , has the title Melanoma progression. GDS1989's summary contains the phrase: melanoma in situ. The Human disease ontology, provides the concept Melanoma in our system's dictionary. Therefore, our concept recognition tool produces the following annotations:
Element GDS1989 annotated with concept DOID:1909 in context title;
Element GDS1989 annotated with concept DOID:1909 in context summary;
The structure of the Human disease ontology shows that DOID:1909 has 36 direct or indirect parents such as for instance DOID:169, Neuroendocrine Tumors and DOID:4, Disease, therefore the transitive closure on the is_a relation generates the following annotations:
Element GDS1989 annotated with concept DOID:169 with closure;
Element GDS1989 annotated with concept DOID:4 with closure;
We processed the annotations corresponding to the annotations of 369 GEO datasets (GDS) and 1045 TMAD datasets available at the time this work was conducted. We then evaluated the ability of our ontology-based indexing scheme to enable the identification of experiments for the following use cases: 1) Accurately identify experiments related to particular diseases 2) Identify gene and protein expression datasets corresponding to diseases from both GEO and TMAD. These use cases were defined based on current research in translational bioinformatics and prior reviews indicating the need for such integrative analyses [1–3, 8].
Categorization of GEO datasets according to the Semantic type after excluding matches to high level ontology terms.
Number of GDS
Disease or Syndrome
Injury or Poisoning
Mental or Behavioral Dysfunction
Overview of the number of GEO datasets for concepts in the neoplastic process and disease or syndrome category
Examples of cancers with many GEO datasets
Acute myeloid leukemia
Acute lymphoblastic leukemia
Examples of cancers with few GEO datasets
Acute promyelocytic leukemia
Examples of diseases with many GEO datasets
Disease or Syndrome
Disease or Syndrome
Chronic obstructive pulmonary disease
Disease or Syndrome
Examples of diseases with few GEO datasets
Disease or Syndrome
Disease or Syndrome
Disease or Syndrome
We have previously presented results on processing annotations in TMAD. For the current work, we reprocessed the records in TMAD and preformed the evaluation as described before . The average precision and recall was 85% and 95% respectively. The annotations of 1045 datasets mapped to 902 disease related concepts in the UMLS. We do not discuss this further because this has been described in our previous work .
From the 902 disease related datasets in TMAD and 241 disease related GDS that we identified in GEO, we were able to identify 45 disease related concepts for which there were datasets in both GEO and TMAD – and hence are potential candidates to support further analysis. Many of them are high level matches (such as Leukemia) that are accurate but too high level to enable correlative analyses.
Diseases for which there are both gene expression and tissue microarray datasets
Acute myeloid leukemia
Clear cell carcinoma
Renal cell carcinoma
Cutaneous malignant melanoma
Clear cell sarcoma
Accuracy of identifying disease related datasets
Accuracy in identifying disease related datasets
Precision = 83.8%
Recall = 86.6%
Accuracy in identifying disease related datasets after limiting high level matches
Precision = 89.9%
Recall = 80.6%
Next, we evaluate the ability to accurately match up the right tissue array datasets with gene expression datasets. Out of the 45 candidate datasets proposed as corresponding between GEO and TMAD, on manual inspection all of them were accurate matches, though 12 were high level terms such as Cancer, Syndrome, and Sarcoma. We consider these as false positive because such matched are uninformative for the purpose of matching up disease related datasets across repositories. This gives us a precision of 73%. We were unable to perform a recall analysis for this task because it is extremely time consuming to manually examine all the TMAD and GEO datasets to determine the number of matches that were missed by our method. One notable disease for which datasets exist in both but were not accurately identified is Breast carcinoma. Most datasets in GEO for this disease are labeled with the term Breast Cancer (C0006142) and those in TMAD are labeled with the term Breast carcinoma (C0678222). These terms have different CUIs in the UMLS and hence we were unable to match up these datasets.
The National Center for Biomedical Ontology (NCBO)  develops and maintains a Web application called BioPortal to access biomedical ontologies. This library contains a large collection of ontologies, such as GO, NCIT, International Classification of Diseases (ICD), in different formats (OBO, OWL, etc.). Users can browse and search this repository of ontologies both online and via a Web services API.
Number of elements annotated from each resource in the current prototype
Number of elements
Resource local size (Mb)
Number of direct annotations (mgrep results)
Total number of 'useful'1 annotations
Average number of annotating concepts
Gene Expression Omnibus
ARRS GoldMiner (subset)
In our prototype, we have processed: (1) gene-expression data sets from GEO and Array Express, (2) clinical-trial descriptions from Clinicaltrials.gov, (3) captions of images from subset of ARRS Goldminer, and (4) abstracts of a subset of articles from PubMed. Table 5 shows both the current number of elements annotated and the number of annotations created from each resource that we have processed. Our prototype uses 48 different biomedical ontologies that give us 793681 unique concepts and 2130700 terms. As a result of using such a large number of terms, our system provides annotations for almost 100% of the processed resources. The average number of annotating concepts is between 359 and 769 per element, with an average of 27% of these annotations resulting from directly recognizing a term. In the current prototype, concept recognition is done using a tool developed by National Center for Integrative Biomedical Informatics (NCIBI) called mgrep . We have conducted a thorough analysis of the accuracy as well as scalability and flexibility of mgrep in recognizing concepts representing disease names, body parts and biological processes. The prototype design of the annotation level is such that we can plug-in other concept recognizers. The prototype is available online at .
In the current work we have processed annotations of tissue microarray samples in TMAD as well as annotations of gene expression datasets in GEO and mapped them to concepts in the UMLS. One of the insights in our work is that disease names are rarely longer than five words and it is computationally tractable to perform an exhaustive search for all possible five-word permutations of the text-annotation.
Mapping text-phrases to UMLS concepts has been performed by other researchers in the past [21, 22]. Most of these approaches are for the purpose of automatic indexing of biomedical literature, and have been shown to be inadequate for processing annotations of high-throughput datasets [1, 8]. It has also been shown that for the task of identifying concepts from annotations of high-throughput datasets, simple methods perform equal or better than Metamap [1, 8, 10, 15]
In the current work, we use very simple methods to process text-annotations and map them to ontologies to demonstrate that such automatic mapping enables integration of gene expression and tissue microarray datasets.
Currently the only way to query the processed annotations is via SQL queries. Performing complicated SQL queries is not always possible for all end-users and the ontology hierarchies and the mapped annotations should drive specialized query-interfaces . One possible approach is to process the user's query text using the same indexing method that mapped the annotations of the datasets and retrieve those datasets that have the largest intersection with the concepts identified in the processed query. In future we will develop interfaces to search these processed annotations as part of our work at the National Center for Biomedical Ontology, to create resources and methods to (help biomedical investigators) store, view, and compare annotations of biomedical research data.
As we all know, genomic scale data is increasing in volume and diversity. Currently, little attention is being paid to the problem of developing methods to integrate complementary data types such as gene expression microarrays and tissue microarrays based on their annotations [3, 5].
We believe it is possible to perform ontology based indexing of free-text annotations of high throughput datasets in public databases to enable such integration. If the annotations of multiple databases – such as that of GEO, TMAD, SMD, PharmGKB – are processed in this manner, it will enable users to perform integrated analyses of multiple, high throughput, genomic scale dataset s. In order to demonstrate the feasibility as well as utility of ontology based indexing, we have developed a prototype system to annotate elements from public biomedical resources and enable users to search the indexed resources in an ontology driven manner.
We have demonstrated that we can effectively map the text-annotations of microarray datasets in GEO as well as annotations of tissue microarrays in TMAD to concepts from vocabularies in the UMLS. Our results show that we can map disease names and disease related concepts with high precision and recall.
We demonstrate how such mapping to concept hierarchies offers the ability to identify corresponding datasets across different repositories for integrative analyses. We identified 23 candidate datasets for further study. We have implemented the mapping functionality as a PERL module named UMLS-Query, which is available with documentation at .
We have described a prototype implementation of an ontology-based annotation and indexing system. The system processes text metadata of gene-expression data sets, descriptions of radiology images, clinical-trial reports, as well as abstracts of PubMed articles to annotate them automatically with concepts from appropriate ontologies. The system enables researchers to locate relevant biological data sets for integrative analyses.
We acknowledge Robert Marinelli and Matt van de Rijn for access to TMAD data. NHS, CM and MAM acknowledge support from NIH grant U54 HG004028. APC, AJB and RC acknowledge support from the Lucile Packard Foundation for Children's Health and the National Library of Medicine (K22 LM008261). This work is supported by the National Center for Biomedical Computing (NCBC) National Institute of Health roadmap initiative; NIH grant U54 HG004028. We also acknowledge assistance of Manhong Dai and Fan Meng at University of Michigan as well as Chuck Kahn for the access to the Goldminer resource.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 2, 2009: Selected Proceedings of the First Summit on Translational Bioinformatics 2008. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S2.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.