Constructing a semantic predication gold standard from the biomedical literature

Background Semantic relations increasingly underpin biomedical text mining and knowledge discovery applications. The success of such practical applications crucially depends on the quality of extracted relations, which can be assessed against a gold standard reference. Most such references in biomedical text mining focus on narrow subdomains and adopt different semantic representations, rendering them difficult to use for benchmarking independently developed relation extraction systems. In this article, we present a multi-phase gold standard annotation study, in which we annotated 500 sentences randomly selected from MEDLINE abstracts on a wide range of biomedical topics with 1371 semantic predications. The UMLS Metathesaurus served as the main source for conceptual information and the UMLS Semantic Network for relational information. We measured interannotator agreement and analyzed the annotations closely to identify some of the challenges in annotating biomedical text with relations based on an ontology or a terminology. Results We obtain fair to moderate interannotator agreement in the practice phase (0.378-0.475). With improved guidelines and additional semantic equivalence criteria, the agreement increases by 12% (0.415 to 0.536) in the main annotation phase. In addition, we find that agreement increases to 0.688 when the agreement calculation is limited to those predications that are based only on the explicitly provided UMLS concepts and relations. Conclusions While interannotator agreement in the practice phase confirms that conceptual annotation is a challenging task, the increasing agreement in the main annotation phase points out that an acceptable level of agreement can be achieved in multiple iterations, by setting stricter guidelines and establishing semantic equivalence criteria. Mapping text to ontological concepts emerges as the main challenge in conceptual annotation. Annotating predications involving biomolecular entities and processes is particularly challenging. While the resulting gold standard is mainly intended to serve as a test collection for our semantic interpreter, we believe that the lessons learned are applicable generally.


Background
Large-scale information extraction (IE) from scientific literature is increasingly used to support advanced knowledge management and discovery systems [1][2][3]. The utility of such systems depends on the quality of the extracted information. Manually annotated goldstandard corpora are critical for evaluating the accuracy and usefulness of information extraction systems [4]. In the biomedical domain, various corpora annotated for semantic phenomena have been constructed in recent years; annotations range from named entities [4][5][6][7], to semantic relations, such as protein-protein interactions [8,9], protein/gene/RNA relationships [10], disease-treatment relations [11], clinical relations [12], biological events [13], and gene regulation events [14]. More recently, the notion of "silver standard" has also been introduced [15], referring to harmonization of automated system annotations, as a proxy to labor-intensive gold standard annotation.
The gold standard corpora have often focused on text drawn from a narrow subdomain, adopting a particular semantic representation, addressing a small set of semantic types and aiming to provide training and evaluation support for specialized IE systems. These corpora differ with respect to their level of granularity and whether there is an ontological basis to the entity and relationship types used. For example, one of the most popular corpora in recent years has been the GENIA event corpus [13], drawn from the scientific literature on transcription factors in human blood cells. It is based on the notion of biological events, uses a few dozen Gene Ontology (GO) [16] event types and has been the basis for recent biological event extraction systems as well as two BioNLP Shared Task competitions [17,18]. The generally narrow focus of such corpora and their specific representation formalisms render them largely unsuitable for evaluating IE systems using different formalisms or resources.
We have been developing a semantic interpreter, Sem-Rep [19], which extracts content from biomedical text in the form of semantic predications. A semantic predication is a logical subject-predicate-logical object triple whose elements are drawn from the UMLS knowledge sources [20]; the subject and object pair corresponds to UMLS Metathesaurus concepts and the predicate to a relation type in an extended version of UMLS Semantic Network. While the UMLS Semantic Network has not been designed as an ontology in a strict sense, the extended version that SemRep uses [21] serves as an ontological resource: it defines a domain model consisting of concept types (semantic types), relation types (ontological predicates) and the relationships that can hold between concept types (ontological predications). Each semantic predication extracted by SemRep is an instantiation of an ontological predication. We refer to this extended version of the UMLS Semantic Network as the SemRep ontology henceforth.
(1) MRI revealed a lacunar infarction in the left internal capsule.
(2) C0024485: Magnetic Resonance Imaging (Diagnostic Procedure)-DIAGNOSES-C0333559: Infarction, Lacunar (Disease or Syndrome) C0152341: Internal Capsule (Body Part, Organ, or Organ Component)-LOCATION_OF-C0333559: Infarction, Lacunar (Disease or Syndrome) SemRep processing is supported by an underspecified syntactic analysis based on the UMLS SPECIALIST Lexicon [22] and the MedPost part-of-speech tagger [23]. MetaMap [24] is used to map simple noun phrases to UMLS Metathesaurus concepts. Entrez Gene [25] serves as a supplementary source to the UMLS Metathesaurus with respect to gene/protein terms, which are identified using ABGene [26], in addition to MetaMap. Indicator rules map syntactic phenomena, such as verbs, nominalizations, prepositions, and modifier-head structure in the simple noun phrase, to ontological predicates from the SemRep ontology. SemRep currently uses the 2006AA release of the UMLS knowledge sources, due to the prevalence of ambiguity in later releases.
The lack of a suitable, manually annotated gold standard corpus has so far precluded a formal evaluation of SemRep (for focused, task-based evaluations, see [27][28][29]); system improvements and modifications have been informally evaluated through error analysis. A formal evaluation requires conceptual annotation with respect to the UMLS; that is, text fragments need to be mapped to concepts and relations in the UMLS, which provides a formal representation of domain knowledge. Considering that the UMLS Metathesaurus, the basis for conceptual information, consists of 92 source vocabularies and more than 1.2 million concepts (in 2006AA release), it is clear that such conceptual annotation is an extremely challenging task.
Large-scale conceptual annotation is not generally attempted in the biomedical domain. In fact, apart from the recent CRAFT corpus [4] and the CLEF corpus [12], we are not aware of any such annotation work. The ontology-based semantic annotation of the CRAFT Corpus [4] concentrates on biomolecular entities and processes, including gene/gene products, chemicals, sequence types, molecular functions, and cellular components. Ninety-seven full-text articles were annotated with concepts from eight terminologies, six of them from the OBO library [30]. They report that relationship annotation between concepts is ongoing work. Intended for clinical IE research, the CLEF corpus [12] annotates information about clinical entities and the relations between them, along with temporal information. It is limited to clinical notes and reports of deceased patients who had a diagnosis of neoplasms. CLEF builds on a relevant subset of UMLS concepts and relations as domain knowledge. Some semantic types are conflated and relations (predicates) renamed in accordance with the goals of the project. For instance, CLEF semantic type Condition includes symptoms, complications, functions, diagnosis, and problems, conflating several UMLS semantic types. In contrast to these manual conceptual annotation efforts, Jimeno et al. [7] semi-automatically created a small corpus annotated with disease concepts from UMLS Metathesaurus. Their semi-automatic methodology involves domain expert assessment of disease concepts identified by MetaMap, a dictionary lookup method, and a statistical method.
In this article, we present an annotation study in which we annotated 500 sentences from MEDLINE abstracts with 1371 semantic predications. Our annotation follows the conceptual annotation paradigm and adopts the entire UMLS as the domain model. Furthermore, we do not limit ourselves to text from a specific subdomain or adopt specific terminologies, in contrast to more focused efforts such as CLEF and CRAFT. While our methodology bears some similarities to the CLEF methodology in terms of the domain model and the guidelines, we use fine-grained UMLS semantic types, rather than coarse semantic groups, allowing for more flexibility. With respect to our ongoing research, the resulting gold standard reference is mainly intended to (a) facilitate comparison between SemRep releases and (b) guide further system development by allowing annotators to add comments and notes in a standardized manner so that problem areas can be identified. Several limitations of the gold standard reference may include its relatively small size, its sentence-bound annotation, and its binary, UMLS Semantic Network semantic relation formalism, arguably not as rich a semantic representation as predicate-argument structures with semantic roles (PASBio [31], GENIA event corpus [13], GREC [14]). While small corpus size may present challenges for learning purposes, it serves well for our primary goal of evaluation. On the other hand, the semantic predication representation (triples) has been shown to be simple, intuitively accessible, and tractable for large-scale knowledge discovery applications [3]. Furthermore, this representation lends itself readily to the Semantic Web and linked data movements, which aim to encode knowledge in subject-predicate-object triples for large-scale automatic reasoning [32]. Considering the challenges posed by this task, we believe our corpus presents a good first step in a largerscale conceptual/relational annotation. It can serve as a gold standard reference for evaluating UMLS-based relation extraction systems and the lessons learned can provide guidance for future efforts in this area.

Methods
We conducted the annotation study in three phases: a) practice annotation phase, b) main annotation phase and c) adjudication phase. Before explaining these phases in more detail, we briefly discuss two fundamental aspects of our study: annotators and interannotator agreement.

Annotators
Three annotators, all authors of this paper, were involved in the annotation process. The annotators have diverse backgrounds; annotator A is a linguist, B a computer scientist and C a physician/biomedical informatics researcher. All three annotators have natural language processing experience and are experts in the SemRep methodology as well as the UMLS knowledge sources.
Using domain experts as annotators is often considered a good strategy to ensure validity and reliability of annotation. However, the tendency of domain experts to rely on inference due to their background knowledge has also been noted [13]. When annotation is concerned with a relatively narrow biomedical subdomain, it is generally feasible to recruit domain experts as annotators. However, several aspects of our annotation study make it difficult to find such annotators. First, we do not focus on a narrow biomedical subdomain. The consequence is that we need to either recruit experts who are knowledgeable in almost all aspects of biomedicine, or to find tens of annotators who can annotate different sentences on topics of their expertise. Neither option seemed feasible within the scope of our annotation. Furthermore, the two non-physician annotators (A and B) work with biomedical text on a full-time basis, and they expressed comfort with annotation after the practice phase. Secondly, in our annotation study, UMLS expertise is perhaps as crucial as domain expertise, and all three annotators are intimately familiar with the UMLS knowledge sources.
Recruiting domain experts who are also familiar with the UMLS would clearly be even more challenging. We believe that our small team of annotators with their expertise in UMLS and our multi-phase annotation methodology allows us to strike a balance between annotation reliability and validity.

Interannotator Agreement
The common approach to calculating interannotator agreement on classification tasks is to use kappa () statistic [33], defined as = (Pr(a) -Pr(e))/(1-Pr(e)), where Pr(a) is the relative observed agreement between annotators, and Pr(e) is the chance agreement. Calculating for semantic predication annotations is non-trivial, since this type of linguistic annotation is not a simple classification task. Semantic predication annotation task can be decomposed into several subtasks for which agreement can be measured independently: (a) finding relation indicators, (b) finding textual mentions of arguments, (c) mapping the relation indicators to ontological predicates, and (d) mapping the textual mentions of arguments to concepts. With the exception of subtask (c), the space of possible annotations is either not clearly defined or very large, making the calculation of Pr(e) and therefore , challenging. For example, consider subtask (a). One way to calculate for this subtask is to impose constraints on what can be annotated as an indicator. For example, only verbs may be considered as indicators and the number of verbs in a sentence can be used to calculate chance agreement Pr(e). We did not impose such constraints since we aimed for breadth of coverage in our annotation study. This leads to an explosion of possible indicator annotations and to the case in which Pr(e) essentially approaches zero. In such cases, it has been shown that approximates F-measure among pairs of annotators [34]. Based on these observations and in line with annotation studies that share similarities with ours [12,14], we adopted F-measure for interannotator agreement. Between two sets of annotations, we calculated it as the F-score of one set of annotations, when the second is taken as the gold standard.

Practice Phase
For the practice annotation phase, we randomly selected 50 sentences (average of 29.8 tokens per sentence) from 50 MEDLINE abstracts published in the last 10 years. These sentences were extracted from the database of Semantic MEDLINE [27], a Web application to manage the results of PubMed searches, and were manually checked by the first author to ensure correct sentence segmentation. All three annotators participated in the practice annotation phase, each annotating the same set of 50 sentences. Given the annotators' familiarity with SemRep, minimal guidance for annotation was provided at this phase and the annotators were asked to annotate semantic predications expressed in the sentences, the textual mentions of their arguments and the predicate and the indicator type (i.e., whether the predicate is a verb (VERB), preposition (PREP), nominalization (NOM), participle (PART), etc.). To lighten the burden of finding an appropriate UMLS Metathesaurus concept corresponding to a textual mention, UMLS Metathesaurus concepts were extracted from these sentences using MetaMap [24] and were provided to the annotators (an average of 9.86 concepts per sentence). The guidance at this phase consisted of the following items: 1. A list of core ontological predicates relevant to SemRep and their definitions from the UMLS. For ontological predicates that are not part of the official UMLS Semantic Network but are in the SemRep ontology, we used our own definitions. We provide these definitions in the Appendix.
3. A sample sentence annotation, provided in the Appendix and illustrated in Figure 1.
4. Basic instructions consisted of the following: a) Annotation should be restricted to semantic predications involving the core ontological predicates that are provided. Other predicates, even though  they may be legitimate, should be ignored. On the other hand, the annotators are not restricted to follow the ontological predications that exist in the SemRep ontology. b) The UMLS concepts extracted by MetaMap are not necessarily the best possible mappings. In addition, MetaMap may be unable to find a mapping. When in doubt, the annotator should try to find a concept that better matches the text (a UMLS Metathesaurus concept or an Entrez Gene term) using the UMLS Terminology Services (UTS [35]) or Entrez Gene [36], keeping in mind that SemRep currently uses the 2006AA version of the UMLS knowledge sources.
c) The annotation should be text-bound. That is to say, domain knowledge and inference should play a minimal role in annotation and the annotator should be concerned with what is explicitly stated in the text. To ensure this, the annotator should explicitly indicate the textual mentions that provide the basis for the annotation (those indicating the subject, object and the predicate), as well as the indicator type for the annotation.
Providing only minimal guidance, we aimed to identify the major challenges in annotating predications and find ways to deal with them in the main annotation phase. This first phase was collaborative: the annotators were free to discuss sentences, concerns and difficulties and develop solutions.
We chose not to use a particular annotation tool, since none of the existing tools fully met our needs, especially with respect to access to terminological resources. We decided not to develop a study-specific, in-house annotation tool due to time constraints. Instead, the annotators were instructed to simply type semantic predications in a text document, along with the textual mentions that trigger the predication components. Then, based on the results of the first phase, annotators were provided scripts that recognized formatting, spelling errors, and inconsistencies in annotation. These scripts were used by annotators in subsequent phases and helped them resolve such errors.
The practice annotation phase was completed in three weeks. After this phase concluded, the first author analyzed each annotation set to identify annotation patterns. The analysis results were then discussed among the annotators and served to refine the guidelines for the second phase. We present these refinements in the Results section. We also computed a baseline interannotator agreement between pairs of annotators. To assess agreement with respect to textual mentions vs. conceptual information, we calculated interannotator agreement using two criteria: a. strict equivalence criterion, where for two semantic predications to be considered equal, their subject-predicate-object triples must match exactly.
b. relaxed equivalence criterion, where the exact match of the textual mentions of the arguments and of the predicate that establishes their relationship are considered sufficient for equality. In other words, conceptual match is not required between predication elements.

Main Annotation Phase
In the main annotation phase, two annotators (A and B) annotated a set of 500 randomly selected sentences drawn from 308 MEDLINE abstracts from the past 10 years (average of 27.9 tokens per sentence). Similar to the practice phase, sentences were drawn from the Semantic MEDLINE database, were checked for integrity, and UMLS Metathesaurus concepts extracted by MetaMap were provided for reference (9.09 concepts per sentence). The annotators worked independently in this phase, which concluded in eight weeks.
Based on our observations from the practice phase, we extended the equivalence criteria underlying interannotator agreement in two ways: a. Predication equivalence (PE): A pair of distinct semantic predications may be considered equivalent under specific conditions when one inverts the arguments of the other and the predicates correspond to certain types. For instance, a predication X-LOCATIO-N_OF-Y may be considered equivalent to Y-PART_OF-X predicates when arguments (X or Y) correspond to biomolecular entities.
b. Gene/gene product correspondence (GP): A pair of concepts may be considered equivalent when one corresponds to a gene term and the other corresponds to its gene product. For instance, the concept "C0287531: DUSP1 protein, human" (Amino Acid, Peptide, or Protein, Enzyme) is considered equivalent to "C1333257: DUSP1 gene" (Gene or Genome).
We also assessed the effect on annotation of domain knowledge provided to the annotators. For this purpose, we distinguished between two types of domain knowledge, conceptual and relational, and measured interannotator agreement on a subset of predications based on the following availability criteria, illustrated in (3)(4)(5) below: a. Availability of conceptual knowledge (CK): A predication fulfills this criterion if both the subject and object arguments were extracted by MetaMap and, thus, were provided to the annotators.
b. Availability of relational knowledge (RK): A predication fulfills this criterion if it is sanctioned by the Sem-Rep ontology. In other words, it corresponds to an existing ontological predication. c. Availability of conceptual and relational knowledge (CRK): A predication fulfills this criterion if it satisfies both (a) and (b), the previous two criteria.
For illustration, consider the sentence fragment in (3). Relevant concepts identified by MetaMap are given in (4), and an annotated predication in (5).
(3) ... UDP-Glc is required in the synthesis of proteoglycans. ... While the predication in (5) is correct, it does not fulfill criterion (a) above because MetaMap fails to identify the subject "C0041988: Uridine Diphosphate Glucose (Biologically Active Substance)," an argument absent in (4). It also fails to fulfill criterion (b) because the corresponding ontological predication "Biologically Active Substance-PRODUCES-Amino Acid, Peptide, or Protein" is not licensed by the current SemRep ontology. Since neither criterion (a) nor (b) are met, it follows that the predication does not fulfil criterion (c), either. As a result, the predication in (5) is not included when interannotator agreement calculation is restricted by any of the available domain knowledge criteria above.

Adjudication
Finally, the third annotator (C) examined the annotations provided by each annotator and adjudicated them to create the current gold standard. During this phase, annotator C was free to discuss the annotations with the actual annotator to understand his/her reasoning. This phase concluded in four weeks.

Results
In this section, we provide detailed information regarding different phases of the annotation, including the number of semantic predications annotated, their distribution by ontological predicate types, and indicator types. We report strict and relaxed interannotator agreement using various equivalence criteria and domain knowledge perspectives. For the main annotation phase, we measure interannotator agreement at the ontological predicate and ontological predication level, highlighting some of the annotation difficulties with respect to biomedical subdomains. Examining annotation differences in the practice phase, we refined our guidelines for the main annotation phase and extended interannotator agreement measures, which we also discuss in this section.

Practice Phase
In this first phase, 50 sentences were annotated, with an average of 2.68 semantic predications per sentence. We show the number of predications annotated by each annotator at this phase in Table 1, and the most frequently annotated ontological predicates with their annotation frequency in Table 2. Indicator types that signal predications are provided in Table 3, which shows that verbal predicates are considered at similar rates among annotators, while there are larger gaps in consideration of other types.
Interannotator agreement was fair to moderate in this phase, as shown in Table 4. The highest agreement occurred between annotators A and C (0.475), consistent with the largely similar patterns they exhibited in the types of ontological predicates they annotated, as shown in Table 2. Unsurprisingly, the interannotator agreement increases, albeit at different rates, when equivalence of textual mentions is also considered as the basis for agreement (relaxed equivalence) as well as the conceptual equivalence (strict equivalence).

Refining the Guidelines
The results of the practice phase helped us identify several aspects of annotation that are challenging. These were discussed among the annotators and several additions and clarifications were made in the annotation guidelines. These included the following:

Selecting appropriate UMLS Metathesaurus concepts
As stated earlier, we provided the annotators UMLS Metathesaurus concepts identified by MetaMap for reference. After the first phase, our analysis revealed that annotation behavior was diverse with regard to these MetaMap-provided concepts. One annotator almost exclusively relied on them, while the other two more extensively utilized the UMLS to find more adequate concepts. In addition, one of the annotators considered Metathesaurus concepts from newer UMLS releases. In the light of discussions over these differences, annotators were instructed to use the following decision criteria in selecting an appropriate concept: 1. If the concept identified by MetaMap adequately describes the textual mention, use it.
2. A concept that is clearly more general than the one stated in the text (that is, a hypernymic (ISA) relation holds between them) cannot be used as a replacement.
3. If MetaMap does not associate any concept with the textual mention OR if the associated concept seems inadequate, try to find a UMLS 2006AA concept that is appropriate.
4. When there is no corresponding UMLS 2006AA concept, a concept from a newer UMLS release may be used, provided that it does not violate the decision criterion (2) above.

Finding Entrez Gene terms
SemRep uses Entrez Gene [25] as a supplementary vocabulary to UMLS Metathesaurus when it identifies gene/ protein terms. The mapping to Entrez Gene terms is achieved via pattern matching in SemRep. When available, these symbols were also provided to annotators in the practice phase. Similar to their reactions to Meta-Map-supplied UMLS concepts, annotators also relied on Entrez Gene terms to varying degrees. One annotator tried to find Entrez Gene terms for all gene/protein mentions, whether or not a corresponding UMLS concept was found, and disambiguate model organisms using the context surrounding the textual mention, while another completely ignored Entrez Gene terms.
We identified two issues regarding these terms: (a) when is it required to annotate a textual mention with an Entrez Gene term? (b) is disambiguation with respect to model organisms necessary? We resolved these issues by concluding that the UMLS is the primary knowledge source for SemRep, and a gene/protein mention needs to be annotated with an Entrez Gene term only if no corresponding UMLS concept is found. Further, we decided that Entrez Gene terms should be limited to those in the Homo sapiens taxon, since SemRep currently only considers this taxon, and context beyond individual sentences may need to be taken into account for determining the model organism.

Semantic Network Ontological Predicates
The practice phase revealed some confusion with respect to the difference between an ontological predicate and its textual mention. For instance, for the fragment "... an association of 5-HTTLPR with intensity dependence of auditory-evoked potentials...", two annotators annotated the semantic predication "C0170657: serotonin transporter (Biologically Active Substance)-ASSOCIATED_WITH-C0015215:Auditory Evoked Potentials (Organism or Tissue Function)." The ontological predicate ASSOCIATED_WITH is used in SemRep in a restricted sense, referring to only gene-disease association. Although the semantic type of the object in the predication is not a disorder, the annotators used this ontological predicate because they were influenced by the choice of textual mention "association." The difference was made explicit in the guidelines.
We further clarified the definitions of several ontological predicates, taking into account how they are conceptualized in the SemRep ontology. For instance, we noted that comparative predicates (COMPARED_WITH, HIGHER_THAN, LOWER_THAN, SAME_AS) are limited to substance and therapeutic procedure semantic types for the time being, while PROCESS_OF is limited to disorder subjects. We also distinguished INTER-ACTS_WITH/AFFECTS and INHIBITS/DISRUPTS predicate pairs more explicitly: INTERACTS_WITH and INHIBITS relations hold between substances, while AFFECTS and DISRUPTS take processes as objects.

Hypernymic Relations
There was disagreement over what constitutes a hypernymic (ISA) relation. Problematic annotations included "C0050940: Lansoprazole (Organic Chemical)-ISA-   C0599473: Enantiomer (Chemical Viewed Structurally)" for the noun phrase "lansoprazole enantiomers" and "C1443775: Epidermal growth factor receptor inhibitor (Pharmacologic Substance)-ISA-C0450442:Agent (Chemical Viewed Functionally)" for the fragment "Two oral EGFR inhibitors ... are small-molecule agents ...." We concluded our discussion on this topic by distinguishing between taxonomic relations that pertain to structural aspects (as in the former predication above) and other kinds of taxonomic relations, including those pertaining to functional aspects (the latter). It was clarified in the guidelines that structural taxonomic relations are not hypernymic.

Extending Interannotator Agreement Calculation
As briefly mentioned earlier, based on analysis of the agreement results in the practice phase, we extended equivalence criteria for interannotator agreement calculation. We describe these extensions in more detail in this section.

Equivalence of Semantic Predications (PE)
We found that two distinct semantic predications derived from the same textual mentions may be too close in meaning to select one over the other. For instance, we identified cases where one annotator used a predication with the predicate LOCATION_OF, while the other preferred one with the predicate PART_OF and inverse arguments. For the textual fragment "... alleles in two cell lines", one annotator annotated "C0002085: Alleles (Gene or Genome)-PART_OF-C0007600: Cell Line (Cell)", while the other annotated "C0007600: Cell Line (Cell)-LOCATION_OF-C0002085: Alleles (Gene or Genome)." Such similarity in meaning is partly rooted in the UMLS Semantic Network. For instance, the ontological predication "Virus-LOCATIO-N_OF-Biologically Active Substance" is a valid ontological predication, as is one with inversion of these arguments and PART_OF as predicate ("Biologically Active Substance-PART_OF-Virus"). In addition, some indicator expressions can be ambiguous with respect to their meaning; for instance, the preposition "in" may map to either predicate in this example. Examining instances of the LOCATION_OF/PART_OF variation in the practice phase, we concluded that at the biomolecular level, the difference in meaning is blurred to the extent that two semantic predications may be considered equivalent for interannotator agreement calculation. However, we also noted that there are clear exceptions. For example, consider the sentence fragment "... truncated Bid (tBid), which translocates to mitochondria ...". While the predication "C0026237: Mitochondria (Cell Component)-LOCATION_OF-C1144558: tBid Protein (Amino Acid, Peptide, or Protein)" seems acceptable for this fragment, its counterpart with the predicate PART_OF ("C1144558: tBid Protein (Amino Acid, Peptide, or Protein)-PART_OF-C0026237: Mitochondria (Cell Component)") does not. To handle such exceptions, we manually judged the cases in which there was agreement due to this equivalence criterion for correctness as a post-processing step to interannotator agreement calculation and corrected the agreement score accordingly.
A situation similar to LOCATION_OF/PART_OF equivalence concerns PRODUCES/PART_OF predicates and the verbal indicator "express" and its derivations. A relation indicating gene expression events does not currently exist in the SemRep ontology, leading one annotator to use PART_OF consistently for such events, and the other to use PRODUCES. In the fragment "The expressions of c-Myc, Ki-67, MMp-2 ... in cancer tissues", one annotator chose "C1334508: MKI67 gene (Gene or Genome)-PART_OF-C0040300: Body tissue (Tissue)" and the other "C0040300: Body tissue (Tissue)-PRODUCES-C1334508: MKI67 gene (Gene or Genome)." We treat these cases as equivalent as well. Interannotator agreement using these two equivalence criteria is denoted as PE below.

Correspondence of Gene and Gene Products (GP)
In the molecular biology literature, gene names are often used to denote gene products (proteins), leading to term ambiguity. For instance, in the fragment "TNFR1/Fas engagement results in the cleavage of cytosolic Bid to ...", without knowing the context, it is difficult to determine whether TNFR1, Fas or Bid refer to genes or gene products. This ambiguity extends to the UMLS Metathesaurus as well. For instance, Bid maps to "C1332410: BID gene (Gene or Genome)" and "C0531588: BID protein (Amino Acid, Peptide, or Protein, Biologically Active Substance)." On the other hand, TNFR1 is mapped to "C0255808: tumor necrosis factor receptor 1A (Amino Acid, Peptide, or Protein, Receptor)" as well as "C1363984: TNFRSF1A gene (Gene or Genome". We extended our interannotator agreement calculation to take this into account and considered it a match when one predication involves the concept name "X gene", while the other involves the name "X protein*", where * indicates wildcard characters. This correspondence criterion accommodates the former (the simple case of Bid), while the equivalence in the case of TNFR1 is ignored. Although the two concepts ("tumor necrosis factor receptor 1A" vs. "TNFRSF1A gene") corresponding to the textual mention ("TNFR1") are considered Related Terms in UMLS, it is difficult to establish their correspondence from their concept names alone. Correspondence of this type is denoted as GP below.

Limiting Agreement Calculation by Available Domain Knowledge (CK/RK/CRK)
To assess the difficulty of annotation due to the openended, exploratory nature of our annotation, we also limited interannotator agreement calculation based on the availability of conceptual knowledge to the annotator (CK) as well as that of relational knowledge (RK) and both (CRK), as mentioned earlier. Our intuition in limiting calculation to a subset of annotated predications was that there would be more substantial agreement when the annotator chooses concepts and ontological predications from a predefined set.

Main Annotation Phase
Two annotators (A and B) were involved in the main annotation phase. The average number of semantic predication annotations per sentence was slightly lower than that in the practice phase (2.64 vs. 2.68, respectively); one annotator (A) was relatively consistent between the practice and main annotation phases, while the other (B) annotated significantly fewer in the main phase, as shown in Table 5.
The interannotator agreement results for the main annotation phase are presented in Table 6. With the improved guidelines provided to the annotators, the basic strict agreement between the annotators increased from 0.415 to 0.500, while the basic relaxed agreement increase was higher: from 0.428 to 0.535. Additionally, we computed interannotator agreement using the extended equivalence criteria (rows 2-4). Adding the predication equivalence criterion (PE) increased the agreement by approximately 3%, while gene/gene product correspondence criterion (GP) provided a small increase of 0.5% overall. Limiting the comparison to conceptual knowledge provided (CK-column 3), the agreement reaches substantial levels with an increase of approximately 12%. Relational domain knowledge alone (RK-column 4) had less effect on agreement (an increase of about 3.5%). With both conceptual and relational knowledge provided to annotators (CRK-column 5), the agreement rises more than 15% in comparison to the basic case. We only show relaxed agreement in the basic case, where the increase from the strict counterpart is about 3.5%. We note that this trend was observable in the cases of predication filtering, as well (columns 3-5). We consider strict agreement with base + PE + GP equivalence as the main agreement criterion in the rest of the paper (0.536).
When we consider the indicator types that signal predications, modifier-head constructions (MOD_HEAD) appear more frequently as indicators than in the practice phase, as shown in Table 7. In addition, due to a clarification regarding how the indicator type for hypernymic relations should be annotated, we observed an increase in the frequency of the indicator type SPEC, which essentially indicates that the hypernymic relation between the concepts should be licensed by the UMLS Metathesaurus hierarchy in addition to being indicated by a textual clue.
The ontological predicate distribution in the main annotation phase is presented in Table 8. Comparing the distribution to that in the practice phase (as shown in Table 2), one noticeable trend is a relative increase in the number of PROCESS_OF, PART_OF, TREATS and AFFECTS predicates and a decrease in INHIBITS and INTERACTS_WITH predicates, possibly due to clarifications over definitions of the predicates after the practice phase. AFFECTS and INHIBITS predicates are not shown in Table 2 and Table 8, respectively, since they occur less frequently in the respective phases. However, we note that the increase in the number of AFFECTS predicates from the practice phase to the main annotation phase is 3.5%, whereas the decrease in that of INHIBITS is 8%. It is also worth mentioning that the high frequency of INHIBITS in the practice phase seems artificial, since most of those annotations were the result of a single instance that involved complex coordination.   During this phase, we also computed interannotator agreement (using strict agreement with base + PE + GP criterion) at the ontological predicate level to assess whether some types are more difficult to annotate than others. Among the most frequent predicates, there is less agreement on predicates relating biomolecular entities or processes (INTERACTS_WITH and AFFECTS, for example). On the other hand, predicates concerning disorders, anatomical parts and population groups yield overall higher agreement. Further examining ontological predicates with highest and lowest interannotator agreement (Table 9), these findings are confirmed. Among ontological predicates annotated more than 10 times, the highest disagreement rates were associated with those involving biomolecular entities and processes (all but PRECEDES are molecular-level predicates). On the other hand, the highest agreement rate was for PRO-CESS_OF and PREVENTS, both of which are diseaserelated ontological predicates. Similar topical trends were observable at the ontological predication level, as well (Table 10). Among those ontological predications annotated more than 10 times between annotators, the highest disagreement rates were found in those concerning the biomolecular entities and processes, while disagreement was lowest in predications involving disorders and population groups. These results clearly establish bio-molecular relations and disorder/population relations as two extremes in the spectrum in terms of ease of annotation.
Annotators were allowed to provide comments for their annotations, and we examined the comments corresponding to the predications on which there was disagreement. The most frequent comments in disagreement cases are given below. The number of disagreement instances and the corresponding interannotator agreement scores are given in parentheses. The interannotator agreement score for a specific type of comment was calculated by considering all the predications marked with that comment, regardless of whether there was disagreement or not. a

Adjudication Phase
In the adjudication phase, annotator C (adjudicator) examined the annotations provided by A and B,  Table 9 Highest and lowest agreement rates by ontological predicates, annotated more than 10 times  Table 10 Highest (column 1-3) and lowest (column 4-6) agreement rates for ontological predications annotated more than 10 times (N > 10) resolving differences and determining the gold standard.
As well as selecting one annotation from one set over another from the other set, the adjudicator could also override annotations from both sets (generating new predications) or keep annotations from both sets (complementary predications). The final semantic predication count is 1371, an average of 2.74 predication per sentence. The maximum number of predications per sentence is 23. The frequency distributions by ontological predicate types and indicator types after reconciliation are presented in Table 11 and Table 12, respectively. While interannotator agreement at this stage is not meaningful since the annotation sets were available to the adjudicator, we measured it to assess whether the adjudicator clearly prefers one set of annotation to the other. The adjudicator agreed with annotator B at a much higher rate (0.835 vs. 0.658). This seems likely due to the fact that the adjudicator took B annotations as the basis and only added an A annotation when B annotation was deemed incorrect.

Discussion
We conducted a relatively open-ended, multi-phase annotation study, in which we aimed to assess the feasibility of iteratively constructing a reasonable gold standard reference based on UMLS domain knowledge. The results presented in the previous section show that this is a feasible undertaking. On the other hand, they confirm that conceptual annotation is extremely challenging; the main difficulty is mapping textual mentions to ontological terms (entities, processes, functions), a timeconsuming and labor-intensive task. This is evidenced by the fact that interannotator agreement increases to acceptable levels (0.659) when the comparison is limited to conceptual knowledge available to the annotators. We discuss some of the challenges in more depth below.

Mapping text to ontological concepts
The core notion in mapping textual mentions to ontological concepts is semantic equivalence, which entails that the ontological concept must express the meaning of the textual mention adequately, and should not correspond to a more general or more specific meaning than the textual mention. However, this equivalence is often very difficult to establish. In our study, the difficulty of establishing semantic equivalence is further compounded by the nature and size of the primary reference terminology we used for conceptual information, the UMLS Metathesaurus. In addition, we defined the task in an open-ended manner and allowed the annotators to use newer UMLS releases when the primary knowledge source (2006AA UMLS) is inadequate, which increased the complexity of the task and lowered the interannotator agreement.
To ease the burden for the annotator, we provided UMLS concepts identified by MetaMap as reference, corresponding to the domain knowledge available to SemRep. While this was useful to a large extent, it did not prevent disagreements regarding what constitutes an adequately expressive, semantically equivalent mapping. In some cases, the concepts identified by MetaMap may even be misleading, as we found out, since one annotator often considered a given concept adequate for the textual mention in question, while the other was dissatisfied with the mapping and identified a clearly better UMLS Metathesaurus concept. For example, MetaMap failed to completely map the noun phrase "the D3 receptor" to the existing UMLS concept "C0082341: dopamine D3 receptor" (Receptor) and identified the concept "C0597357: receptor" (Receptor) instead. One annotator considered this more general concept adequate, while the other identified the former as the more adequate mapping. A similar situation occurred with "minimally invasive surgery," where MetaMap identified "C0543467: Operative Surgical Procedures" (Therapeutic or Preventive Procedure), while, in fact, there exists a more suitable concept "C0282624: Surgical Procedures, Minimally Invasive" (Therapeutic or Preventive Procedure) for the mention. There were also cases where MetaMap failed to identify a concept corresponding to a textual mention, and it was difficult for the annotator to  see that the textual mention may in fact have a UMLS counterpart. For instance, only one annotator found and used the concept "C1254042: Anatomical maturation (Physiologic Function)" to correspond to the head of the more specific noun phrase "lung maturation." Despite these difficulties, however, evidence from the annotators indicates that concepts identified by MetaMap were overall helpful to annotators and lightened the annotation load considerably. Furthermore, interannotator agreement increased significantly (approximately 12%) when only MetaMap-supplied concepts were used, supporting our basic intuition that selecting a concept from a predefined list is less demanding than searching for a semantically equivalent concept from a sizable terminology and assessing its adequacy, an essentially subjective task.
In other gold standard annotation studies, such as the CRAFT corpus and the CLEF corpus, carefully selected, well-curated vocabularies or ontologies were used. For example, the CRAFT corpus made use of six OBO ontologies. However, finding the semantically equivalent concept for a textual mention is still found to be challenging. In our annotation, consistent with the SemRep methodology, we did not limit ourselves to specific UMLS vocabularies and aimed to be more general in our coverage. However, this generality has its drawbacks, as expected. While synonymous terms from different vocabularies are clustered together to form concepts in the UMLS Metathesaurus, there are still a significant number of distinct concepts that are very close in meaning. For example, for the noun phrase "mineral metabolism impairment," MetaMap identified two separate concepts: "C0678715: mineral metabolism" (Organism Function) and "C0011155: Deficiency" (Functional Concept). Neither annotator found these mappings adequate (it can be argued that the best mapping would be a combination of both); however, they found different concepts, similar in meaning, to correspond to the meaning of the mention. One identified "C0687148: Mineral deficiency" (Disease or Syndrome), while the other identified "C0154260: Disorder of mineral metabolism" (Disease or Syndrome). While these concepts are not exactly the same, the difference between their meanings is arguably very small. A similar situation holds between "C0403716: Calculus in renal pelvis" (Disease or Syndrome) and "C0022650: Kidney Calculi" (Pathologic Function) for the phrase "pelvic stones." (note that "Pathologic Function" as a semantic type for "Kidney Calculi" seems incorrect; however, we did not judge the correctness of UMLS semantic type assignments in this work.) The problem seems even more acute when it comes to gene and gene product concepts, as exemplified earlier. It seems necessary to take into account such overlap and similarity in meaning, both in computing interannotator agreement and in system evaluation. We attempted to address the semantic equivalence of concepts to some extent by devising an equivalence criterion involving gene and gene products; however, this criterion is limited to a single, clear pattern, and is unable to accommodate more complex cases of equivalence. One principled solution may be to allow annotators to annotate more than one concept for a given textual mention when a clear preference for a concept cannot be established and to give partial (or full) credit for any of the concepts. This would clearly increase annotation complexity, and it may be more feasible as a postprocessing step.

Annotation by domain
Our results also show that difficulty of annotation varies by the biomedical subdomain. Ontological predications relating to biomolecular entities and processes are most challenging to annotate, while those on the topic of population characteristics of disease seem to be the most straightforward. We explain the difficulty of annotating biomolecular relations by observing the following: a. Molecular biology text is hardest to read and interpret for a non-expert and none of the annotators are experts in this subdomain. Furthermore, as Friedman et al. [37] have shown, biomolecular domain text constitutes a sublanguage, with very specific characteristics, such as complex and nested relations as well as more prevalent syntactic ambiguity. One interesting, syntactically ambiguous case involved the fragment "... IL-1betainduced ROS formation, NF-kappaB activation, and MCP-1 secretion ...", where one annotator took the modifier "IL-1beta-induced" as modifying "NF-kappaB activation" and "MCP-1 secretion" as well as "ROS formation", and annotated the predications given in (6), while the other annotator took the modifier to modify "ROS formation" only, and did not annotate the predications in (6). b. The coverage of the UMLS Semantic Network with respect to molecular biology is perhaps the least extensive. In fact, we have extended the UMLS Semantic Network to create the SemRep ontology in prior work [21,38] specifically to redress this gap.
c. Based on evidence from the annotation process, it seems more challenging for MetaMap to map textual mentions of biomolecular entities to UMLS Metathesaurus than to map those of entities from other subdomains. For example, the semantic type most frequently associated with concepts identified by annotators and not by MetaMap was Amino Acid, Peptide, or Protein.
In contrast to genomic concepts and relations, disorder and population group concepts and their relations are well covered in the UMLS and MetaMap has less difficulty in mapping text to these concepts. In addition, text concerning disease characteristics is overall easier to interpret for a non-expert.

Diachronic change in domain knowledge
Another complicating factor in conceptual annotation is that the ontologies and vocabularies the concepts are derived from may change over time. There are two alternatives to address this situation: (a) using a static snapshot of the knowledge source (b) re-annotating at each update of the knowledge source. Our methodology was similar to the first alternative: we adopted the 2006AA release of UMLS as the primary source for conceptual information, while also recognizing that newer releases of UMLS have wider coverage, especially with respect to new drug or gene/protein names, thus allowing the annotators to use these resources, if necessary. Since we enforced text-bounded annotations, we believe that it will be possible to update the gold standard semi-automatically at future updates of the knowledge sources, using MetaMap or a similar program.

Conclusions
We have presented the construction of a semantic predication gold standard from biomedical literature text using the conceptual annotation paradigm. Manual conceptual annotation is considered extremely challenging, and our results confirm this perception, while also confirming that reasonable interannotator agreement could be achieved iteratively, consistent with the findings of Bada et al. [4]. While the domain knowledge we used (UMLS) reflects the application-specific aspect of our annotation, we believe that our analysis and discussion provide important insights for future efforts in this area.
The resulting gold standard constitutes the first resource, to our knowledge, in the biomedical domain that incorporates conceptual annotation of semantic relations in a wide variety of subdomains. Two sets of annotations and the adjudicated gold standard are made publicly available [39] for research purposes. A UMLS license is required. The corpus size is relatively small and may be insufficient for training information extraction systems. However, we believe it can serve as a benchmark to evaluate independently developed systems based on UMLS knowledge sources. Our goal is to use it for this particular purpose, as well as to guide future system development.