Conceptual analysis is the process of mapping from natural language texts to a formal representation of the objects and predicates (together, the concepts) meant by the text. The history of attempts to build programs to do conceptual analysis dates back to at least 1967 . Recent advances in the availability of high quality ontologies, in the ability to accurately recognize named entities in texts, and in language processing methods generally have made possible a significant advance in concept analysis, arguably the most difficult and general natural language processing task. Here we report on the design, implementation and several evaluations of OpenDMAP, an ontology-driven, integrated concept analysis system that significantly advances the state of the art. We also discuss its application to three important information extraction tasks in molecular biology.
Information extraction (IE) efforts are widely acknowledged to be important in harnessing the rapid advance of biomedical knowledge, particularly in areas where important factual information is published in a diverse literature. In a recent PLoS Biology essay Rebholz-Schuhmann  argued, "It is only a matter of time and effort before we are able to extract facts [from articles in the primary literature] automatically. The consequences are likely to be profound." Existing examples include extraction of information about gene-gene interactions , alternative splicing , functional analysis of mutations , phosphorylation sites , and regulatory sites . The primary significance of OpenDMAP to these efforts is that it leverages the large-scale efforts being made in biomedical ontology development, such as the Open Biomedical Ontologies Foundry (OBO Foundry) .
Logical representations of reality, such as those built on the OBO Foundry, use a set of predicates that formally describe properties of, or relationships among, objects. Predicates are defined with a specific number and type of admissible arguments. For example, the predicate expresses might be specified to take two arguments, a gene and a cell type, meaning that the specified gene is expressed in all normal cells of the specified type. Such predicates can also be related to each other through abstraction ("is a") and packaging ("part of") hierarchies, as done in the OBO Foundry. The semantics defined by the predicates and hierarchies in such ontologies provide a powerful tool for natural language processing.
Independently constructed ontologies have played at best a modest role in prior natural language processing systems. Guarino  characterizes various uses of ontologies in information systems: only systems that use an ontology at run time (rather than during system construction) to explicitly represent the domain knowledge exploited by the system qualified for what Guarino called an "ontology-driven information system proper." To our knowledge, OpenDMAP is the first system developed to exploit a community consensus ontology as the central organizing principle of an information extraction system; for example, none of the systems that participated in the 2004 TREC Genomics evaluation for recognizing instances of Gene Ontology terms in text  meet the Guarino definition. Other language processing systems have used either small, ad hoc conceptual representations developed specifically for the application, or structured linguistic resources, such as WordNet , which do not meet the logical requirements for an ontology. While the implementation reported below exploits only a small portion of the OBO Foundry, and the crucial Relationship Ontology component of the Foundry is still in an early stage of development, the organizing principles of OpenDMAP generalize straightforwardly.
The MetaMap system  identifies biomedical concepts from free-form textual inputs and maps them to entries in the Unified Medical Language System (UMLS) metathesaurus; SemRep  is a related system that maps to predications drawn from the UMLS semantic network, and SemGen [14, 15] is another related system that is focused on mapping to UMLS terms relevant to the etiology of genetic disease. These systems and their extensions have been used to extract semantic relationships relevant to pharmacogenomics  and to compare alternative sources of information , among other applications. OpenDMAP is like MetaMap and its descendents in that it can only produce output drawn from a predefined semantic representation. The main difference is that MetaMap, SemRep and SemGen are structured as traditional NLP systems, with a lexicon that enumerates possible concepts that might be associated with a word or phrase. Multiple possible mappings are returned, with rankings. OpenDMAP provides an alternative method of organizing knowledge about language, so that each concept has associated with it a set of patterns that describe how that concept can be realized in language; there is no explicit lexicon.
To appreciate the differences between OpenDMAP and previous work in biomedical text mining, it is also useful to contrast its handling of syntactic structure and of semantic content with other systems. At one end of the spectrum are systems that employ essentially asyntactic representations. Early in the modern period of genomic natural language processing, some such systems were able to achieve significant (and in some cases ground-breaking) results using techniques based on text literals only. These include [18–20]. One line of subsequent work has attempted to increase the coverage of these early systems, which utilized manually-built patterns, by automatically acquiring considerably larger sets of patterns – see, for example, Huang et al. 2004 . Another line of subsequent work has focused on adding a modest, but still useful, level of linguistic abstraction by explicitly including either lexical categories (parts of speech), word stems, or both [22, 23]. These systems were essentially agrammatical; in contrast, OpenDMAP utilizes a classic form of "semantic grammar," freely mixing text literals, semantically typed basal syntactic constituents, and semantically defined classes of entities.
Although OpenDMAP is capable of utilizing full syntactic parses, the patterns for the three separate tasks discussed in this paper utilize primarily shallow syntactic parses (the development phase of the transport project reports results using syntactic dependency information). It remains to be seen what depth of syntactic parsing is useful in biomedical text mining. Some early systems explored full parsing [24, 25], but they were not generally fruitful, and typical systems have employed at most shallow parsing [26–28]; only recently has productive attention returned to syntactically ambitious approaches to biomedical text [29–31], much of it taking a dependency-based, rather than a constituent-based, approach.
All of the systems discussed thus far have in common the fact that they employ some notion of explicit patterns, be they agrammatical, syntactic, or semantic. In a separate line of work, patterns are entirely implicit – that is, they exist only to the extent that they are captured by orthogonal features. This work approaches relation extraction as a classification problem; a classic example is the work of Craven and Kumlein 1999 . Bunescu et al. 2005  presents a detailed analysis of a number of classification-based approaches; the state of the art is characterized by the participants in the recent BioCreative protein-protein interaction shared task .
OpenDMAP has been applied in three domains: protein transport, protein-protein interaction and the expression of a gene in a particular cell type. The three application domains are independently significant. Protein transport, the directed movement of proteins from one cellular compartment to another, is a broadly important biological phenomenon. Although protein subcellular localization information is centralized (e.g. through ontological annotations at NCBI and in various model organism databases), information about transport is not. Protein transport information is published throughout the scientific literature, but no previous method was able to capture it systematically. Protein-protein interaction extraction has been the subject of dozens of systems (see, e.g. a review in ). Widely used web resources such as IHOP  and Chilibot  are based entirely on automated extraction of protein-protein interactions from text. This task was used in the BioCreative community evaluation, described below. The third application area, extraction of assertions that a particular gene is expressed in a particular cell type, is of significance since it appears to be the predicate found most frequently in the biomedical literature; a form of the verb "express," usually its nominalization "expression," appears in nearly 20% of NCBI's GeneRIFs .
The protein transport task is illustrative of another distinguishing aspect of the OpenDMAP approach: it provides mechanisms for handling relationships involving more than two entities. Note that the protein transport predicate has at least three arguments: what protein is transported, from where, and to where (our model also includes a fourth argument: the transporting protein). Although some linguistic expressions of the concept may elide an argument, the predicate itself inherently describes a greater than binary relationship. Wattarujeekrit et al.  and Cohen and Hunter  present evidence that many important predicates in biomedicine require more than two arguments. However, most previous efforts at extracting relationships from biomedical text have addressed exclusively binary relationships. Geneways  and RLMPS-P  are the only other biomedical IE systems of which we are aware that extracted greater than binary relationships, and neither is ontology-driven.
Assessing the accuracy of an information extraction system is a very labor-intensive activity. In order to identify information that could have been extracted, but was not (a "false negative"), a person must go through a large volume of text to determine all of the relevant assertions. To estimate the reliability of these manually derived assertions, at least two people must complete that task to assess inter-rater reliability. Once such data is used for one evaluation and system developers have seen it, further use of the data will generate upwardly biased accuracy estimates as system developers fit their systems to it. For these reasons, large-scale community evaluations of information extraction systems are particularly important. The second Critical Assessment of Information Extraction in Biology, (BioCreative) [34, 42], community evaluation included a test of systems designed to extract human protein-protein interaction information from the full texts of hundreds of journal articles, called the IPS task. Human curators from the IntAct database  manually extracted interaction assertions from these articles using the same curatorial standards as for the database. The results produced by human experts were compared to the results submitted from 45 systems developed by laboratories around the world, providing the best current assessment of the accuracy of protein interaction information extraction systems. The performance of OpenDMAP on the protein interaction task was evaluated as part of this shared task. More limited evaluations of the accuracy in the other applications are also reported in the results section.
The accuracy of an information extraction system depends on the genre of texts on which it operates . This report demonstrates the application of OpenDMAP to full texts of scientific journal articles, to Medline abstracts, and to GeneRIFs (single sentences or sentence fragments that are selected by human curators for relevance to the function of a particular gene product). GeneRIFs are particularly attractive targets for information extraction, due to their roughly sentential length (identified by  as the optimum), breadth of coverage, manual preselection for relevance, and association with at least one normalized gene reference. Despite these attractive features, this is the first report of an information extraction system targeting them.