A corpus for plant-chemical relationships in the biomedical domain

Background Plants are natural products that humans consume in various ways including food and medicine. They have a long empirical history of treating diseases with relatively few side effects. Based on these strengths, many studies have been performed to verify the effectiveness of plants in treating diseases. It is crucial to understand the chemicals contained in plants because these chemicals can regulate activities of proteins that are key factors in causing diseases. With the accumulation of a large volume of biomedical literature in various databases such as PubMed, it is possible to automatically extract relationships between plants and chemicals in a large-scale way if we apply a text mining approach. A cornerstone of achieving this task is a corpus of relationships between plants and chemicals. Results In this study, we first constructed a corpus for plant and chemical entities and for the relationships between them. The corpus contains 267 plant entities, 475 chemical entities, and 1,007 plant–chemical relationships (550 and 457 positive and negative relationships, respectively), which are drawn from 377 sentences in 245 PubMed abstracts. Inter-annotator agreement scores for the corpus among three annotators were measured. The simple percent agreement scores for entities and trigger words for the relationships were 99.6 and 94.8 %, respectively, and the overall kappa score for the classification of positive and negative relationships was 79.8 %. We also developed a rule-based model to automatically extract such plant–chemical relationships. When we evaluated the rule-based model using the corpus and randomly selected biomedical articles, overall F-scores of 68.0 and 61.8 % were achieved, respectively. Conclusion We expect that the corpus for plant–chemical relationships will be a useful resource for enhancing plant research. The corpus is available at http://combio.gist.ac.kr/plantchemicalcorpus.


Background
Plants are a type of natural product that includes trees, herbs, edible foods, among others [1]. They are known to be abundant sources of chemicals with potential therapeutic effects [2]. Furthermore, since natural compounds have been empirically proven to have relatively fewer side effects and unwanted reactions, plants have been widely used for thousands of years for the treatment of diverse diseases and their symptoms [3]. Because of these advantages of plants, many studies have been carried out assessing the effectiveness of plants against diseases [4][5][6], and the number of patents related to pharmaceutical natural *Correspondence: hyunjulee@gist.ac.kr 1 School of Information and Communications, Gwangju Institute of Science and Technology, Chemdangwagi-ro, Gwangju, Republic of Korea Full list of author information is available at the end of the article products including plants has continuously increased [7]. To further enhance such efforts, identification of active substances or chemical compounds in plants is important because many diseases can be relieved or treated by chemicals in plants that control the activities of proteins related to diseases [8,9].
In this respect, many researchers have tried to construct public databases containing plant-related information, especially plant-chemical relationships that represent which compounds are included in which plants. In general, such data were manually collected from books, published results, and empirically widely known facts. One of the most representative databases containing plantchemical relationships is the traditional Chinese medicine (TCM) database@Taiwan [10], which is a 3D small molecular structure database of TCM for virtual screening or molecular simulation. Although this database currently contains 32,364 compounds from 352 different herbs, animal products, and minerals, which were manually collected from medical texts and scientific publications, only a small fraction of plants used for medicinal purposes are included in the database. TCMID [11] is one of the largest TCM databases providing TCM-related data, including prescriptions, herbs, herbal ingredients, targets, drugs, diseases, and relationships between these entities. This database was assembled by applying text mining methods to biomedical articles and by integrating other public databases such as TCM-ID [12], TCM database@Taiwan [10], HIT [13], STITCH [14], OMIM [15], and DrugBank [16]. However, even in TCMID, only 8159 plants (referred to as "herb" in TCMID) are currently provided, which is relatively small compared to the more than 150,000 plants defined in the NCBI Taxonomy database [17].
With the continuous accumulation of biomedical articles, it is possible to extract such plant-chemical relationships from the literature if proper corpora and text mining (TM) models are available. Until now, few TM systems that extract information about chemicals contained in plants from biomedical articles have been developed [18]. Jensen et al. [18] proposed an integrated text mining system based on manually constructed corpora for analyzing associations between plants and health effects. They extracted plant-chemical and plant-disease relationships from biomedical articles using text mining models and then integrated these two relationships to infer chemicaldisease relationships. Although Jensen et al. [18] has presented a text mining model to extract plant-chemical relationship, a corpus used to text-mine the relationship is not available for public use. Generally, developing a corpus requires significant efforts because several annotators need to manually identify entity names, trigger words, and positive or negative relationships in the articles. Because most text mining methods require an annotated corpus to learn models for detecting entity mentions and for extracting relationships between entities, providing a corpus specific for each domain is important [19]. Therefore, a plant-chemical corpus with a proper format such as the BioC XML format [20], which is a common interchange format widely used in the BioNLP community, is important for constructing and evaluating future text mining systems that extract plant-chemical relationships from texts.
Our study aims to develop a corpus for plant-chemical relationships. Here we describe processes for constructing the corpus for two entities of plants and chemicals and their plant-chemical relationship. In addition, we construct and evaluate a rule-based model, which automatically extracts plant-chemical relationships from articles. In this work, "plant-chemical relationships" are classified into two types: (i) a positive relationship means that a plant contains a chemical, i.e., a chemical is derived from a plant or a chemical is a part of the molecular structure of a plant (e.g., actinidin has previously been reported as the major allergen in kiwifruit); (ii) a negative relationship means that there is no information specifying that a plant contains a chemical (e.g., both parthenolide and Feverfew extract showed a time-dependency in their action). The corpus currently consists of 267 plant entities, 475 chemical entities, and 1,007 plant-chemical relationships (550 and 457 positive and negative relationships, respectively), which are drawn from 377 sentences in 245 PubMed abstracts. Our corpus will be useful for developing new natural language processing (NLP) tools related to plant-chemical relationships.

Related works
Automatic extraction of semantic relationships between domain-specific entities from articles requires recognition of the entity names and syntactic analysis of texts. For this task, an annotated corpus is necessary. Thus, we review several corpora and text mining systems related to chemicals and plants.
The Linnaeus corpus [21] annotates entities related to species and organisms, including plants, from 100 full-text articles in the PMC Open Access document data set. It was built for tools to recognize and to normalize species names and uses a dictionary-based approach with the NCBI taxonomy data. The Species corpus [22] remedies the shortcomings of the Linnaeus corpus, which annotates entities at the full-text level, by annotating entities at the abstract level to increase variability of species names. They selected 100 abstracts from journals in the following eight categories: bacteriology, botany, entomology, medicine, mycology, protistology, virology, and zoology. The corpus currently contains a total of 800 abstracts with annotated information of species mentions. The corpus was used for the development and evaluation of their NER tool based on the dictionary provided by the NCBI taxonomy database [17] for detecting species names.
The CHEMDNER corpus [23] is the most comprehensive data for the development of named entity recognition (NER) systems in the chemical domain. It contains a total of 84,355 chemical entity names from 10,000 PubMed abstracts, which were manually annotated by expert chemistry curators. Each abstract was carefully selected based on document selection criteria to be representative of a wide range of chemistry-related fields. Each chemical entity names was assigned to one of the following seven different subtypes: abbreviation, family, formula, identifier, multiple, systematic, and trivial. The authors of the corpus also provided detailed guidelines for identifying entity names with a proper entity class. The CHEMDNER corpus and the proposed annotation guidelines, which can be expanded by users, are publicly available so that they can be used for researchers developing TM systems in the area of chemicals. However, the corpus does not contain any relationship information.
Li et al. [24] performed manual annotation of chemical and disease entities and relationships between them for the BioCreative V challenge of recognizing disease name entities and extracting chemical-induced disease relationship. The corpus in the BioC XML format currently contains 1,500 PubMed articles with 4,409 chemicals, 5,818 diseases, and 3,116 chemical-disease interactions, all manually annotated. They used some annotation tools including PubTator [25] and NER tools such as DNorm [26] and tmChem [27] to accelerate manual annotation. The strength of this corpus is that chemical-disease relationships were extracted from both within a sentence and across sentence boundaries. Along with [23,24], there are several chemical-or drug-related corpora such as Comparative Toxicogenomics Database corpus [28] and a corpus contained in the Pharmspresso database [29]. However, there are so far no publicly available corpora for plant-chemical relationships.
Jensen et al. [18] proposed an integrated text mining approach and chemoinformatics analysis to enhance understanding of how plant-based diets (fruits, vegetables, and plant-based beverages) affect human health and disease prevention. They accumulated 369,549 plantphytochemical edges from 23,137 compounds and 15,722 plants and 38,090 plant-disease associations from 7,178 plants and 1,613 human disease phenotypes using a Naive Bayes classifier. These relationships were extracted from 21 million PubMed abstracts. Using chemicaldisease relationships inferred from text-mined relationships, they applied chemoinformatics methods to analyze the molecular-level association of a plant-based diet to diseases. For the development and evaluation of the text mining model, an in-house corpus for each relationship type was also constructed. However, the corpus is not publicly available.

Sentence collection and preprocessing
This section describes selection of candidate sentences before constructing the corpus as shown in Fig. 1. For this task, we automatically extracted 13,408,621 PubMed abstracts using PubTator [25], which provides functions for users to download all PubMed abstracts. Then, the following preprocessing steps were performed on the collected PubMed abstracts.
1. For pre-annotating plant names in abstracts, a plant name dictionary was first constructed using public data from TCMID [11] and NCBI Taxonomy [17]. The dictionary contains 333,686 plant names in English, Chinese, and Latin and it is available to download at our corpus web site. After constructing the dictionary, LingPipe [30], a dictionary-based exact-matching NER tool, was applied to all collected abstracts to locate plant names. Chemical names were annotated using ChemSpot [31], which is a specialized tool for locating chemical names that covers trivial names, drugs, abbreviations, and molecular formulas in texts. For the annotation of chemical identifiers (IDs), three types of IDs were used: MeSH, CHEMBL, and CAS Registry Number. 2. Using pre-annotated abstracts from step one, we collected 540,384 co-occurrence sentences in which at least one plant name and at least one chemical name co-occur. The rest of the sentences that do not contain either a plant name or a chemical name were excluded. 3. In this step, the main annotator selected, from 540,384 co-occurrence sentences, candidate sentences to be manually annotated by annotators in the "Annotating the corpus" step. Of the co-occurrence sentences, the number showing a negative relationship was larger than the number showing a positive relationship. Because positive relationships are more informative than negative relationships for showing that a plant contains a chemical, we constructed balanced numbers of positive and negative relationships. Hence, the main annotator randomly selected candidate sentences from the co-occurrence sentences and manually classified each sentence into positive or negative classes, where the numbers of positive and negative sentences were set to be approximately the same. In addition, the main annotator validated whether all the entity mentions and their IDs in sentences were correctly annotated by NER tools. If contents such as mentions and IDs were incorrectly annotated, then the main annotator manually corrected them. When multiple pairs of plant and chemical names were found in a candidate sentence, each plant-chemical pair was classified into a positive or negative relationship; we call each pair "a corpus unit," because more than one relationship can be found in a candidate sentence. For example, in the third line in Fig. 2, the candidate sentence has one chemical name and two plant names, which can produce two plant-chemical pairs (nitrogen-masson pine and nitrogen-Pinus massoniana). In this case, two different candidate corpus units were created from a single sentence (e.g., the third and fourth line in Fig. 2). Likewise, we split all candidate sentences into candidate corpus units.
In contrast with other existing corpora [23,24], which annotate relationships at the abstract level, we collected  Fig. 1 The workflow of corpus construction. The corpus was constructed as follows: (i) we collected PubMed abstracts from the PubTator database; (ii) we applied NER tools, including LingPipe and ChemSpot, to pre-annotate plant and chemical names; (iii) we extracted co-occurrence sentences that contain at least one plant and chemical name; (iv) we randomly selected candidate sentences, where the numbers of positive and negative sentences were set to be approximately the same, and also split them into corpus units; (v) we manually annotated candidate corpus units with our guidelines and also conducted later annotation to harmonize disagreements after annotators finished their annotation tasks; and (vi) we converted annotated corpus units to the BioC XML format the relationships at the sentence level using the above steps because we aimed to expand the diversity of plant names. In many plant-related abstracts or documents, the same plant terms appear multiple times, which may affect the quality of the corpus. The candidate corpus units from candidate sentences selected in the sentence collection and preprocessing step were exported to an Excel format as described in Fig. 2 and then annotated by annotators according to the annotation guidelines described in the next subsection.

Annotation guidelines
Annotation guidelines are defined to support validating NER results extracted by NER tools for entity names and to determine class labels of plant-chemical relationships during the annotation process. The guidelines were continuously updated by the main annotator during the sentence collection and preprocessing step. Because annotators need to check both NER results and relationships between two entities, the guidelines were categorized for the entity annotation and the relationship annotation.

Guidelines for entity annotation
Entity names in the corpus were first annotated by NER tools. However, due to inaccuracy of the NER tools, a fraction of annotated names might not be plants or chemicals, while actual plants or chemicals might be missed. Although these mistakes were initially checked by the main annotator, the other annotators also examined entities of plants and chemicals based on the guidelines.
The general guidelines for both entities are as follows: For plant names, the guidelines are as follows: • Taxonomy and TCMID IDs are allowed.
• Plant names whose actual contextual meaning do not represent plants should not be annotated (e.g., cinnamon rat). • Plant names include specific plants and a family of plants.
• Unnecessary adjectives and nouns next to plant names should not be annotated unless they are included in the plant dictionary (e.g., mashed potato, tobacco yield).
For chemical names, the guideline is as follows: • MeSH, CHEMBL, and CAS IDs are allowed.
• If multiple chemicals are linked together, annotators should consider them as a single name (e.g., linoleic and linolenic acids). • Receptors, transporters, genes, and proteins should not be annotated as a chemical name (e.g., chlorophyll-protein, tricarboxylate transporter, acetylcholine receptor).

Guidelines for relationship annotation
Annotators need to annotate positive and negative relationships between plants and chemicals, which are defined as follows: • Positive relationship: indicates that a plant contains a chemical; a chemical is derived from a plant, or the chemical is part of the molecular structure of the plant (e.g., Bilobalide (BB) is a sesquiterpenoid extracted from Ginkgo biloba leaves). In the example, we find a positive relationship between Ginkgo biloba (plant) and Bilobalide (chemical). • Negative relationship: indicates that a corpus unit does not specify that a plant contains a chemical (e.g., The fenamiphos treatment outperformed all fosthiazate treatments in tobacco yield and root gall reduction).
The guidelines for the relationship annotation are as follows: • Annotators classify a sentence containing a plant and a chemical into a positive or negative relationship. • When a chemical name contextually indicates one of the extraction solvents for plant extracts, annotators should classify this case as "negative relationship." • Metabolism is a process in a set of chemical reactions that modifies a chemical molecule into another molecule for storage or for immediate use in another reaction. According to the definition of metabolism, if a sentence represents that more than one chemical is involved in the metabolism process of a plant, annotators should regard all of them as ingredients of the plant.
-"   • Annotators should consider grammatical structures for deciding relationship types. In the example sentence below, there are two chemical names (anthocyanin) and three plant names (tomato, blackberries, and blueberries). As positive relationships, the former anthocyanin belongs to tomato and the latter anthocyanin belongs to blackberries and blueberries.
-"[Anthocyanin] chemical accumulation in [tomato]  • Annotators should annotate both a weak trigger term and a strong trigger term when a corpus unit has a positive relationship. The weak trigger can be more than one term representing a plant-chemical relationship. On the other hand, the strong trigger should be a single word that is thought to be the most representative word explaining a plant-chemical relationship. In the example of the sentence "The [calcium] chemical contents were highest in the [papaya] plant ," "were highest in" and "in" are the weak trigger term and the strong trigger term, respectively.

Annotating the corpus
This subsection describes the annotation process for plants, chemicals, and their relationships in all the candidate corpus units as shown in Fig. 1. The annotation task for the corpus was performed by three annotators with a basic knowledge of biology and traditional Chinese medicine. The main annotator performed sentence collection and preprocessing, built the annotation guidelines, and carried out the first annotation of the corpus. Two assistant annotators participated in the corpus annotation.
The corpus annotation work went through two phases (phases 1 and 2) because we divided the three annotators into two groups (Group 1: main annotator-assistant annotator 1, Group 2: main annotator-assistant annotator 2). Based on the guidelines, annotators in Group 1 and Group 2 manually annotated the different sets of candidate corpus units. A corpus constructed through the two phases is the main corpus in our work and is called "the primary corpus" hereafter to distinguish it from an additional corpus developed as described in the next section, Constructing a rule development corpus and a rule-based model. The annotation task was conducted by manually filling in fields of the Excel file as shown in Fig. 2, where "PlantName, " "P.ID, " and "P.Off " represent plant name, its identifier, and offset of the plant name in the text, respectively, and "ChemName, " "C.ID, " and "C.Off " represent chemical name, its identifier, and offset of the chemical name in the text. Details about other fields are also explained as follows: • P.Check: Write the letter "O" when all of the contents in "PlantName," "P.ID," and "P.Off" shown in Fig. 2 are correctly annotated and write the letter "X" when any of them are incorrectly annotated. • P.Note: Leave comments about incorrectly annotated contents in "PlantName," "P.ID," and "P.Off." This field is optional. • C.Check: Write the letter "O" when all of the contents in "ChemName," "C.ID," and "C.Off" shown in Fig. 2 are correctly annotated and write the letter "X" when any of them are incorrectly annotated. • C.Note: Leave comments about incorrectly annotated contents in "ChemName," "C.ID," and "C.Off." This field is optional. • Label: Write "POS" or "NEG" according to the relationship type between a plant, colored in red, and a chemical, colored in green, in the "Sentence" column. "POS" indicates that a plant includes a chemical while "NEG" means that there is no positive relationship between them.
• Weak Trigger: Write trigger terms, which represent the positive relationship between a plant and a chemical, in as broad a range as possible. For example, the weak trigger in the first corpus unit in Fig. 2 can be "were the highest in," whereas annotators write "NA" when the "Label" column is denoted as "NEG." This field is required only when the sentence contains a positive relationship between plant and chemical. Leave this field empty when the "Label" is annotated as "NEG." • Strong Trigger: Unlike the weak trigger, the strong trigger should be the single word regarding by the annotators as the most meaningful word that represents the positive relationship between a plant and a chemical. For example, the strong trigger in the first corpus unit in Fig. 2 can be "in," whereas annotators write "NA" when the "Label" column is denoted as "NEG." This field is required only when the sentence contains a positive relationship between plant and chemical. Leave this field empty when the "Label" is annotated as "NEG." After all annotators completed the annotation task, the main annotator collected annotation results to classify agreements or disagreements.

Constructing a rule development corpus and a rule-based model
We developed a rule-based model to extract plantchemical relationships from articles. For this task, the main annotator additionally constructed "a rule development corpus, " which is not a duplicate of the primary corpus described in the Annotating the corpus subsection. Then, the performance of the rule-based model was tested against the primary corpus.
To build the rule development corpus, the main annotator performed the same procedures described in the Sentence collection and preprocessing section and followed the annotation guidelines. The rule development corpus was then used to generate key rules for the model. To infer general grammar rules, we analyzed dependency parse trees (Figs. 3), which were obtained by applying the Stanford Dependency Parser to corpus units in the rule development corpus.
For example, the tree in Fig. 3 shows the dependency structure of a passive sentence, "About 450 mg of FB1 was obtained from 800 g cultured corn. " The noun phrase, "About 450 mg of FB1, " is connected with "nsubjpass" type, which is a passive nominal subject of a passive clause. The noun phrase, "800 g cultured corn, " is connected with "pobj" type, which is an object of a preposition. Then, we can observe that there is a positive relationship between the noun phrase containing a plant name ("corn") and the noun phrase containing a chemical name ("FB1") through the trigger, "were obtained from. " Based on such observations, we define a verbal trigger rule when the trigger type is a transitive form, which is defined below. For new inputs that may have a similar dependency structure with Fig. 3, the rule-based model compares the dependency tree of the new inputs with the verbal trigger rule by checking the following: (i) if there is a noun phrase containing a chemical name, which is connected with "nsubjpass" type; (ii) if there is a noun phrase containing a plant name, which is connected with "pobj" type that is an object of a preposition; and (iii) if the root term "obtained, " which is linked with "nsubjpass" and "prep, " belongs to one of trigger words defined in our trigger dictionary.
Based on observing dependency parse trees of the rule development corpus like the example above, our rulebased model contains six types of rules: (i) a verbal trigger rule; (ii) a prepositional trigger rule; (iii) a relative trigger rule; (iv) an apposition trigger rule; (v) a copula trigger rule; and (vi) a compound noun trigger rule. To clarify the rules, we constructed rule specifications that explain the rules in detail (Table 1). Here, we review each of the rule specifications. Note that NP 0 means a noun phrase containing a plant name and NP 1 specifies a noun phrase in which a chemical name appears.
• A verbal trigger rule: As shown in Table 1, the specification for the verbal trigger has three types according to the trigger type. The first rule structure is defined as NP 0 V tr NP 1 , where the trigger type of V tr is a transitive active form such as "contain" and "include." The second rule structure consists of NP 1 V tr PP NP 0 . In this case, the trigger type is a transitive passive form such as "be contained" and "be extracted." Thus, a preposition denoted as "PP" is required to be located between V tr and NP 0 , which is the constraint. In the third rule structure, V tr should be an intransitive verb such as "consist." Therefore, any preposition such as "of" should be placed next to V tr in the rule structure, NP 0 V tr PP NP 1 . • A prepositional trigger rule: The rule structure has the following two cases: NP 0 PP tr NP 1 and NP 1 PP tr NP 0 . Both NP 0 and NP 1 are allowed to be located on the left and right side of the trigger. Any preposition can be placed in PP tr . • A relative trigger rule: This consists of two types of specifications according to the trigger type. The first rule structure is defined as NP 1 R tr PP NP 0 , where R tr is the past participle form such as "isolated" and "extracted," and any preposition must be located between R tr and NP 0 . The second rule structure is described as NP 0 R tr (PP ) NP 1 , where the trigger type R tr is a gerund form such as "containing" and "having." Note that, in this case, the preposition such

corn -(11)
[amod] [amod] 800g -(9) cultured -(10) as "of" should be located between R tr and NP 1 when the trigger word is the gerund form of the intransitive verb such as "consisting." • An apposition trigger rule: The rule structure has the following two cases: NP 0 AP tr NP 1 and NP 1 AP tr NP 0 . In this specification, triggers can be any token whose dependency type is apposition. For example, a comma between NP 0 and NP 1 can indicate that NP 0 is in apposition with NP 1 or vice versa. Furthermore, we simply set a constraint that a token distance between NP 0 and NP 1 is less than ten to avoid the case that does not represent a relationship between them. • A copula trigger rule: The rule structure has the following two cases: NP 0 C tr NP 1 and NP 1 C tr NP 0 . In this specification, a trigger "C tr " can be any verb regardless of tense if its dependency type is copula.
To reduce false positives, we also set a simple constraint that there must be a token distance between NP 0 and NP 1 of less than ten. • A compound noun trigger rule: The rule structure is NP 0 CN tr NP 1 , where the trigger CN tr is a single white space between NP 0 and NP 1 . For example, "Aloe emodin" indicates that emodin is one of the ingredients in aloe.
We manually created a list of trigger words for each of the six types of rules that express a relationship between plant and chemical ( Table 2). The list of trigger words is also available to download at the corpus web site.

Overall process for extracting plant-chemical relationships
In Fig. 4, we describe the overall process of the rulebased model when new texts are given to the system. When biomedical abstracts are input to the model, for example, NER tools including ChemSpot [31] and the LingPipe [30] dictionary chunker are applied to annotate plant and chemical names in the new texts, respectively. Then, annotated abstracts are split into sentences using the LingPipe sentence splitter. Next, the Stanford Dependency Parser is applied to each split sentence to obtain dependency parse trees (Fig. 3). The dependency parse trees are then used to check whether the structure of each of the dependency parse trees matches one of the rules defined in our model (Table 1) and also whether there is a trigger word that belongs to our list (Table 2). Finally, our system predicts a class label that represents the relationship type (positive or negative) between the identified plants and chemicals.

Results
By applying the steps in the "Sentence collection and pre processing" subsection, we accumulated a total of 1,043 candidate corpus units from 382 candidate sentences, which were obtained from 262 abstracts. The candidate corpus units were divided into phase 1, containing 642 corpus units, and phase 2, containing 401 corpus units, and then annotators in Group 1 and Group 2 annotated the phase 1 units and phase 2 units, respectively. As a result, 939 corpus units out of 1,043 candidate corpus units were agreed upon by the annotators. The remaining 104 corpus units were disagreed upon. The annotators The token distance between NP 0 and NP 1 should be within ten.

Whereas that in PD is [soybean oil][,] [a source of unsaturated fatty acids]. (PMID: 19932903
2 NP 1 AP tr NP 0 Apposition form (e.g. comma) The token distance between NP 0 and NP 1 should be within ten.

[Haematococcus pluvialis] [is] [one of the potent organisms for production of astaxanthin]. (PMID: 23605447)
2 NP 1 C tr NP 0 Be verb form The token distance between NP 0 and NP 1 should be within ten.
[ The rule-based model consists of six types of rules. The first column shows the specification name. Each specification has more than one rule structure shown in the third column. In the rule structure, NP 0 means the noun phrase containing a plant name, and NP 1 represents the noun phrase in which a chemical name appears. The component marked with "tr" represents a trigger word described in the fourth column. We also defined several constraints if necessary discussed these disagreement units in order to reach consensus about the annotation of entities, relations, and triggers. After the discussion, 1,007 corpus units consisting of 550 positive relationships and 457 negative relationships were finally obtained. The corpus includes 267 plant names and 475 chemical names. For easy access and use, we converted the corpus into BioC XML format (Fig. 5). The corpus provides information about the gold-standard sentences with annotated plant and chemical names along with their IDs, locations of entity names, weak/strong triggers only for positive relationships, and class labels indicating whether the sentence includes a positive or negative relationship.

Inter-annotator agreement (IAA)
To assess the accuracy of the corpus, we calculated the IAA scores using simple percent agreement and Cohen's kappa statistic. The simple percent agreement is calculated as follows: [ agreements/(agreements + disagreements)] ×100 %. Cohen's kappa is the most frequently used method for measuring the overall agreement between two annotators, and it is generally regarded as a more robust measurement than the simple agreement calculation. According to [32], kappa values within the range 61 to 80 % are considered as "substantial" agreement, and values between 81 and 99 % constitute "almost perfect" agreement. Tables 3 and 4 show overall IAA scores for annotations. In Table 3, using the simple percent agreement, the overall agreement scores were 99.6 and 94.8 % for the annotated entities and triggers, respectively. Although entity names for plants and chemicals were pre-annotated using NER tools and checked by the main annotator, annotators additionally checked whether pre-annotated entity Apposition trigger (AP tr ) Apposition form any token whose dependency type is "appos" Copula trigger (C tr ) Copula form any token whose dependency type is "cop" Compound noun trigger (C tr ) Compound noun form strings that are made up of plant and chemical names together (e.g. panax ginseng saponin) names, their IDs, and locations of entity names were correct in case of any remaining NER errors. When the annotators agreed on entity names, their IDs, and locations of entity names, it is considered as an agreement between annotators. We also checked whether annotators selected the same weak and strong triggers explaining the positive relationship for a given plant-chemical pair. In addition, we measured the degree of agreements for annotations of the relationship between plant and chemical that are positive or negative. When the 1,043 candidate corpus units from 382 candidate sentences were given to annotators, 939 out of 1043 annotated relationship labels were agreed upon by the annotators. Using the Cohen's kappa statistic, the overall IAA  Fig. 4 The overall method for extracting plant-chemical relationships from abstracts. To extract plant-chemical relationships from biomedical articles, our model annotates plant and chemical names in the articles using LingPipe and ChemSpot, respectively. Then, the system splits annotated abstracts into sentences using the LingPipe sentence splitter to apply the Stanford Dependency Parser, which provides dependency parse trees for each sentence. Finally, the rule-based model checks the grammar structure of the dependency parse trees and the trigger words in sentences to extract relationships score was 79.8 % for the annotated relationship labels, which can be considered "substantial" agreement according to [32]. Table 4 shows the IAA scores for the case where the same annotation indicates that annotators are in agreement on both entities and relationship labels for each corpus unit. The overall IAA agreement score for this case was 78.9 % using the Cohen's kappa statistic.

Disagreements
After the annotation tasks, 104 out of 1,043 candidate corpus units were disagreed upon among the annotators due to the following reasons: subjective interpretation of sentences, misunderstanding of the origin of ingredients, and disregarding guidelines. To harmonize opinions among annotators, annotators performed an additional annotation phase for the disagreed-upon corpus units and specified reasons for their decision. Then, the main annotator collected only the discrepancies that were still in disagreement even after the additional annotation phase, and annotators again discussed the discrepancies to reach consensus on the annotation. As a result, we reached agreement on 68 corpus units from 104 disagreements, and they were added to our plantchemical corpus dataset. Three types of major disagreement cases are introduced in the following. -For the corpus unit in Example 1, one annotator assigned a negative relationship due to misinterpreting the meaning of the sentence. Actually, the sentence includes a positive relationship between the plant "Cruciferae" and the chemical "Allyl isothiocyanate" because they are closely linked with the trigger "a constituent of." • Example 2.    -For the corpus unit in Example 3, one annotator annotated it as a positive relationship between the plant "Cimicifuga racemosa" and the chemical "isopropanol." In fact, as described in the guidelines, the chemical "isopropanol" is used in a common extraction method for deriving active ingredients from plants. In this case, annotations between annotators were different due to disregarding the guidelines.

Evaluating the rule-based model
We  (iv) if the trigger word is detected, the system recognizes that there is a positive relationship between plant and chemical for the corpus. Finally, we obtained predicted class labels (positive or negative) of 939 corpus units and compared them with the originally annotated class labels. As a result, as shown in Table 5, the rule-based model achieved an overall F-measure of 68 %. The F-score for phase 1 was relatively lower than that for phase 2 because corpus units in phase 1 included more diverse variations of grammar structures that were not defined in the rule-based model. Of 939 corpus units, we found that there were 117 false positives and 188 false negatives. False negative errors were mostly due to two reasons: (i) the absence of rules in the model and (ii) the absence of trigger words in the model. For instance, consider "tobacco-specific nitrosamines. " In this case, the rulebased model could not identify it as positive because our model did not contain a grammar structure that can represent "plant-specific chemical, " and also, the trigger word ("-specific") was not defined. In another example, "3-(methylthio)propanal (cooked potato), " a grammar structure that allows a plant-chemical relationship through brackets was not defined in our model. False positive errors were mainly attributed to misunderstanding of semantic meaning in the rule-based model. For example, consider the following phrase: "dichloromethane extract of Feverfew. " In this case, the model predicted it as positive by the prepositional trigger rule. However, dichloromethane is one of the methods for extracting active components from plants but not an ingredient of feverfew. For the next example: "ammonia treatment of rice straw, " the model syntactically predicted it as positive although the semantic meaning is that ammonia was used to store rice straw at high moisture content and to kill weed seeds, indicating that ammonia was not contained in rice straw.
We performed an additional evaluation of the rulebased model by applying it to 43 randomly selected PubMed abstracts. The abstracts contain a total of 59 cooccurrence sentences containing both plant and chemical names. Two annotators manually annotated 113 corpus . This new corpus contains 28 positive and 85 negative relationships. Because the new corpus was constructed at the abstract level, the ratio of positive and negative sentences is different from the primary corpus in our work, where the numbers of positive and negative relationships are similar. Although it contains more negative relationships, the performance of the rule-based model is similar to that of phase 1 in the primary corpus, showing the rule-based model can be used for a corpus at the abstract level as well as a corpus at the sentence level.

Discussion
In this study, we constructed a plant-chemical corpus that can facilitate the development of a relationship extraction system for collecting plant-chemical relationships from texts. Thus, we have introduced guidelines for annotating sentences that represent relationships between plants and chemicals and also described how we constructed the corpus. As a result, we have identified a total of 1,007 plant-chemical corpus units from 377 goldstandard sentences that were selected from 245 PubMed abstracts.
In recent NLP studies, machine learning approaches are more dominantly employed and frequently perform better than rule-based approaches for many NLP tasks [33]. However, the rule-based approach has some advantages; for instance, it is easy to incorporate domain knowledge and to fix the cause of errors although it requires extensive manual labor [34]. In addition, in the process of constructing general rules by analyzing sentences, domain knowledge can be accumulated while machine learning approaches usually work as a black box. In our work, with a relatively small number of 102 corpus units in the rule development corpus, we developed the rule-based model for extracting plant-chemical relationships. Its performance achieved an F-measure of 68 % on 939 corpus units. For comparison, we employed the Turku Event Extraction System (TEES) [35], which is a support vector machine based text mining system. Although TEES was originally developed for extracting relationships between genes and biological events, it can be modified for extracting any binary relationship. We trained TEES with the same 102 corpus units included in the rule development corpus. Because TEES requires a development set, the 68 corpus units that are agreed upon among annotators after resolving the disagreements, was used for the development set. We applied 939 corpus units in the primary corpus to the model. This resulted in an F-measure of 25.7 % on the corpus units, which is poor compared to the rule-based model. This is because the training set was small. Hence, we again trained and evaluated the TEES model using a ten-fold cross validation on a larger set of 1,109 corpus units, including the following: (i) 939 corpus units from the primary corpus; (ii) 68 corpus units that are agreed upon among annotators after harmonizing the disagreements; and (iii) 102 rule development corpus units. Note that 798, 200, and 111 corpus units were used as training, development, and test sets, respectively, for the performance measurement in the ten-fold cross validation. As a result, it achieved an F-measure of 77.7 %. In this respect, it is possible to apply our corpus data to various machine learning methods to achieve better performance.

Conclusions
Our future work is to develop and apply a text mining model to the prediction of plant-chemical relationship in all the abstracts in PubMed based on the corpus developed in this work. This is challenging work because we need to improve NER models and the relationship prediction model by combining the rule-based model and a machine learning approach. It also requires validating the final accuracy of plant-chemical relationships predicted from all abstracts. When this future work is successful, we expect that we can obtain a large number of plantchemical relationships. With the increasing importance of natural products, knowledge about active compounds in herbs has become significant. Numerous diseases can be treated by herb compounds, and various medicines, cosmetics, and other products have widely used herbal compounds as their main components. Thus, we believe that our research provides an important step related to herbs or natural products in text-mining studies.