A corpus for plant-chemical relationships in the biomedical domain
© The Author(s) 2016
Received: 19 January 2016
Accepted: 8 September 2016
Published: 20 September 2016
Plants are natural products that humans consume in various ways including food and medicine. They have a long empirical history of treating diseases with relatively few side effects. Based on these strengths, many studies have been performed to verify the effectiveness of plants in treating diseases. It is crucial to understand the chemicals contained in plants because these chemicals can regulate activities of proteins that are key factors in causing diseases. With the accumulation of a large volume of biomedical literature in various databases such as PubMed, it is possible to automatically extract relationships between plants and chemicals in a large-scale way if we apply a text mining approach. A cornerstone of achieving this task is a corpus of relationships between plants and chemicals.
In this study, we first constructed a corpus for plant and chemical entities and for the relationships between them. The corpus contains 267 plant entities, 475 chemical entities, and 1,007 plant–chemical relationships (550 and 457 positive and negative relationships, respectively), which are drawn from 377 sentences in 245 PubMed abstracts. Inter-annotator agreement scores for the corpus among three annotators were measured. The simple percent agreement scores for entities and trigger words for the relationships were 99.6 and 94.8 %, respectively, and the overall kappa score for the classification of positive and negative relationships was 79.8 %. We also developed a rule-based model to automatically extract such plant–chemical relationships. When we evaluated the rule-based model using the corpus and randomly selected biomedical articles, overall F-scores of 68.0 and 61.8 % were achieved, respectively.
We expect that the corpus for plant–chemical relationships will be a useful resource for enhancing plant research. The corpus is available at http://combio.gist.ac.kr/plantchemicalcorpus.
Plants are a type of natural product that includes trees, herbs, edible foods, among others . They are known to be abundant sources of chemicals with potential therapeutic effects . Furthermore, since natural compounds have been empirically proven to have relatively fewer side effects and unwanted reactions, plants have been widely used for thousands of years for the treatment of diverse diseases and their symptoms . Because of these advantages of plants, many studies have been carried out assessing the effectiveness of plants against diseases [4–6], and the number of patents related to pharmaceutical natural products including plants has continuously increased . To further enhance such efforts, identification of active substances or chemical compounds in plants is important because many diseases can be relieved or treated by chemicals in plants that control the activities of proteins related to diseases [8, 9].
In this respect, many researchers have tried to construct public databases containing plant-related information, especially plant–chemical relationships that represent which compounds are included in which plants. In general, such data were manually collected from books, published results, and empirically widely known facts. One of the most representative databases containing plant–chemical relationships is the traditional Chinese medicine (TCM) database@Taiwan , which is a 3D small molecular structure database of TCM for virtual screening or molecular simulation. Although this database currently contains 32,364 compounds from 352 different herbs, animal products, and minerals, which were manually collected from medical texts and scientific publications, only a small fraction of plants used for medicinal purposes are included in the database. TCMID  is one of the largest TCM databases providing TCM-related data, including prescriptions, herbs, herbal ingredients, targets, drugs, diseases, and relationships between these entities. This database was assembled by applying text mining methods to biomedical articles and by integrating other public databases such as TCM-ID , TCM database@Taiwan , HIT , STITCH , OMIM , and DrugBank . However, even in TCMID, only 8159 plants (referred to as “herb” in TCMID) are currently provided, which is relatively small compared to the more than 150,000 plants defined in the NCBI Taxonomy database .
With the continuous accumulation of biomedical articles, it is possible to extract such plant–chemical relationships from the literature if proper corpora and text mining (TM) models are available. Until now, few TM systems that extract information about chemicals contained in plants from biomedical articles have been developed . Jensen et al.  proposed an integrated text mining system based on manually constructed corpora for analyzing associations between plants and health effects. They extracted plant–chemical and plant–disease relationships from biomedical articles using text mining models and then integrated these two relationships to infer chemical–disease relationships. Although Jensen et al.  has presented a text mining model to extract plant–chemical relationship, a corpus used to text-mine the relationship is not available for public use. Generally, developing a corpus requires significant efforts because several annotators need to manually identify entity names, trigger words, and positive or negative relationships in the articles. Because most text mining methods require an annotated corpus to learn models for detecting entity mentions and for extracting relationships between entities, providing a corpus specific for each domain is important . Therefore, a plant–chemical corpus with a proper format such as the BioC XML format , which is a common interchange format widely used in the BioNLP community, is important for constructing and evaluating future text mining systems that extract plant–chemical relationships from texts.
Our study aims to develop a corpus for plant–chemical relationships. Here we describe processes for constructing the corpus for two entities of plants and chemicals and their plant–chemical relationship. In addition, we construct and evaluate a rule-based model, which automatically extracts plant–chemical relationships from articles. In this work, “plant–chemical relationships” are classified into two types: (i) a positive relationship means that a plant contains a chemical, i.e., a chemical is derived from a plant or a chemical is a part of the molecular structure of a plant (e.g., actinidin has previously been reported as the major allergen in kiwifruit); (ii) a negative relationship means that there is no information specifying that a plant contains a chemical (e.g., both parthenolide and Feverfew extract showed a time-dependency in their action). The corpus currently consists of 267 plant entities, 475 chemical entities, and 1,007 plant–chemical relationships (550 and 457 positive and negative relationships, respectively), which are drawn from 377 sentences in 245 PubMed abstracts. Our corpus will be useful for developing new natural language processing (NLP) tools related to plant–chemical relationships.
Automatic extraction of semantic relationships between domain-specific entities from articles requires recognition of the entity names and syntactic analysis of texts. For this task, an annotated corpus is necessary. Thus, we review several corpora and text mining systems related to chemicals and plants.
The Linnaeus corpus  annotates entities related to species and organisms, including plants, from 100 full-text articles in the PMC Open Access document data set. It was built for tools to recognize and to normalize species names and uses a dictionary-based approach with the NCBI taxonomy data. The Species corpus  remedies the shortcomings of the Linnaeus corpus, which annotates entities at the full-text level, by annotating entities at the abstract level to increase variability of species names. They selected 100 abstracts from journals in the following eight categories: bacteriology, botany, entomology, medicine, mycology, protistology, virology, and zoology. The corpus currently contains a total of 800 abstracts with annotated information of species mentions. The corpus was used for the development and evaluation of their NER tool based on the dictionary provided by the NCBI taxonomy database  for detecting species names.
The CHEMDNER corpus  is the most comprehensive data for the development of named entity recognition (NER) systems in the chemical domain. It contains a total of 84,355 chemical entity names from 10,000 PubMed abstracts, which were manually annotated by expert chemistry curators. Each abstract was carefully selected based on document selection criteria to be representative of a wide range of chemistry-related fields. Each chemical entity names was assigned to one of the following seven different subtypes: abbreviation, family, formula, identifier, multiple, systematic, and trivial. The authors of the corpus also provided detailed guidelines for identifying entity names with a proper entity class. The CHEMDNER corpus and the proposed annotation guidelines, which can be expanded by users, are publicly available so that they can be used for researchers developing TM systems in the area of chemicals. However, the corpus does not contain any relationship information.
Li et al.  performed manual annotation of chemical and disease entities and relationships between them for the BioCreative V challenge of recognizing disease name entities and extracting chemical-induced disease relationship. The corpus in the BioC XML format currently contains 1,500 PubMed articles with 4,409 chemicals, 5,818 diseases, and 3,116 chemical–disease interactions, all manually annotated. They used some annotation tools including PubTator  and NER tools such as DNorm  and tmChem  to accelerate manual annotation. The strength of this corpus is that chemical–disease relationships were extracted from both within a sentence and across sentence boundaries. Along with [23, 24], there are several chemical- or drug-related corpora such as Comparative Toxicogenomics Database corpus  and a corpus contained in the Pharmspresso database . However, there are so far no publicly available corpora for plant–chemical relationships.
Jensen et al.  proposed an integrated text mining approach and chemoinformatics analysis to enhance understanding of how plant-based diets (fruits, vegetables, and plant-based beverages) affect human health and disease prevention. They accumulated 369,549 plant–phytochemical edges from 23,137 compounds and 15,722 plants and 38,090 plant–disease associations from 7,178 plants and 1,613 human disease phenotypes using a Naive Bayes classifier. These relationships were extracted from 21 million PubMed abstracts. Using chemical–disease relationships inferred from text-mined relationships, they applied chemoinformatics methods to analyze the molecular-level association of a plant-based diet to diseases. For the development and evaluation of the text mining model, an in-house corpus for each relationship type was also constructed. However, the corpus is not publicly available.
Sentence collection and preprocessing
For pre-annotating plant names in abstracts, a plant name dictionary was first constructed using public data from TCMID  and NCBI Taxonomy . The dictionary contains 333,686 plant names in English, Chinese, and Latin and it is available to download at our corpus web site. After constructing the dictionary, LingPipe , a dictionary-based exact-matching NER tool, was applied to all collected abstracts to locate plant names. Chemical names were annotated using ChemSpot , which is a specialized tool for locating chemical names that covers trivial names, drugs, abbreviations, and molecular formulas in texts. For the annotation of chemical identifiers (IDs), three types of IDs were used: MeSH, CHEMBL, and CAS Registry Number.
Using pre-annotated abstracts from step one, we collected 540,384 co-occurrence sentences in which at least one plant name and at least one chemical name co-occur. The rest of the sentences that do not contain either a plant name or a chemical name were excluded.
In this step, the main annotator selected, from 540,384 co-occurrence sentences, candidate sentences to be manually annotated by annotators in the “Annotating the corpus” step. Of the co-occurrence sentences, the number showing a negative relationship was larger than the number showing a positive relationship. Because positive relationships are more informative than negative relationships for showing that a plant contains a chemical, we constructed balanced numbers of positive and negative relationships. Hence, the main annotator randomly selected candidate sentences from the co-occurrence sentences and manually classified each sentence into positive or negative classes, where the numbers of positive and negative sentences were set to be approximately the same. In addition, the main annotator validated whether all the entity mentions and their IDs in sentences were correctly annotated by NER tools. If contents such as mentions and IDs were incorrectly annotated, then the main annotator manually corrected them. When multiple pairs of plant and chemical names were found in a candidate sentence, each plant–chemical pair was classified into a positive or negative relationship; we call each pair “a corpus unit,” because more than one relationship can be found in a candidate sentence. For example, in the third line in Fig. 2, the candidate sentence has one chemical name and two plant names, which can produce two plant–chemical pairs (nitrogen–masson pine and nitrogen–Pinus massoniana). In this case, two different candidate corpus units were created from a single sentence (e.g., the third and fourth line in Fig. 2). Likewise, we split all candidate sentences into candidate corpus units.
In contrast with other existing corpora [23, 24], which annotate relationships at the abstract level, we collected the relationships at the sentence level using the above steps because we aimed to expand the diversity of plant names. In many plant-related abstracts or documents, the same plant terms appear multiple times, which may affect the quality of the corpus.
The candidate corpus units from candidate sentences selected in the sentence collection and preprocessing step were exported to an Excel format as described in Fig. 2 and then annotated by annotators according to the annotation guidelines described in the next subsection.
Annotation guidelines are defined to support validating NER results extracted by NER tools for entity names and to determine class labels of plant–chemical relationships during the annotation process. The guidelines were continuously updated by the main annotator during the sentence collection and preprocessing step. Because annotators need to check both NER results and relationships between two entities, the guidelines were categorized for the entity annotation and the relationship annotation.
Guidelines for entity annotation
Entity names in the corpus were first annotated by NER tools. However, due to inaccuracy of the NER tools, a fraction of annotated names might not be plants or chemicals, while actual plants or chemicals might be missed. Although these mistakes were initially checked by the main annotator, the other annotators also examined entities of plants and chemicals based on the guidelines.
Annotators only examine a plant and a chemical for each corpus unit, colored in green and red, respectively (Fig. 2).
Because entity names were pre-annotated using NER tools, annotators should check whether there are missed or incorrectly annotated names and IDs. IDs are annotated as “NA” when there are no proper IDs.
Acronyms should be annotated. When it is not clear whether an acronym indicates a plant or chemical name, an annotator annotates an acronym if original long words are found in the corresponding PubMed abstract.
Taxonomy and TCMID IDs are allowed.
Plant names whose actual contextual meaning do not represent plants should not be annotated (e.g., cinnamon rat).
Plant names include specific plants and a family of plants.
Unnecessary adjectives and nouns next to plant names should not be annotated unless they are included in the plant dictionary (e.g., mashed potato, tobacco yield).
MeSH, CHEMBL, and CAS IDs are allowed.
If multiple chemicals are linked together, annotators should consider them as a single name (e.g., linoleic and linolenic acids).
Receptors, transporters, genes, and proteins should not be annotated as a chemical name (e.g., chlorophyll-protein, tricarboxylate transporter, acetylcholine receptor).
Guidelines for relationship annotation
Positive relationship: indicates that a plant contains a chemical; a chemical is derived from a plant, or the chemical is part of the molecular structure of the plant (e.g., Bilobalide (BB) is a sesquiterpenoid extracted from Ginkgo biloba leaves). In the example, we find a positive relationship between Ginkgo biloba (plant) and Bilobalide (chemical).
Negative relationship: indicates that a corpus unit does not specify that a plant contains a chemical (e.g., The fenamiphos treatment outperformed all fosthiazate treatments in tobacco yield and root gall reduction).
Annotators classify a sentence containing a plant and a chemical into a positive or negative relationship.
When a chemical name contextually indicates one of the extraction solvents for plant extracts, annotators should classify this case as “negative relationship.”
– “[ methanol/ethanol/petroleum/chloroform/ isopropanol ] chemical extracts of [ ginger ] plant ”
Metabolism is a process in a set of chemical reactions that modifies a chemical molecule into another molecule for storage or for immediate use in another reaction. According to the definition of metabolism, if a sentence represents that more than one chemical is involved in the metabolism process of a plant, annotators should regard all of them as ingredients of the plant.
“[ 26-Norbrassinolide ] chemical , identified as a metabolite of brassinolide in cultured cells of the [ liverwort, Marchantia polymorpha ] plant , as well as 26-norcastasterone and 26-nor-6-deoxocastasterone were synthesized”
Synthesis is a process that produces an organic compound in living things. By the definition of synthesis, if a sentence indicates a synthesis phenomenon between a plant and a chemical, annotators should regard this case as an ingredient of a plant.
“Synthesis of [ O-acylhomoserine esters ] chemical was detected only in [ Pisum sativum L ] plant ”
If a sentence explains the positive relationship between a derived plant from an original plant and a chemical, annotators should consider that both the derived plant and the original plant contain the chemical.
“[ Snapdragon ] plant in tomato contains [ anthocyanin ] chemical ”
“Snapdragon in [ tomato ] plant contains [ anthocyanin ] chemical ”
Annotators should consider grammatical structures for deciding relationship types. In the example sentence below, there are two chemical names (anthocyanin) and three plant names (tomato, blackberries, and blueberries). As positive relationships, the former anthocyanin belongs to tomato and the latter anthocyanin belongs to blackberries and blueberries.
“[ Anthocyanin ] chemical accumulation in [ tomato ] plant and at concentrations comparable to the anthocyanin levels found in blackberries and blueberries”
“Anthocyanin accumulation in tomato and at concentrations comparable to the [ anthocyanin ] chemical levels found in [ blackberries ] plant and blueberries”
“Anthocyanin accumulation in tomato and at concentrations comparable to the [ anthocyanin ] chemical levels found in blackberries and [ blueberries ] plant ”
Annotators should annotate both a weak trigger term and a strong trigger term when a corpus unit has a positive relationship. The weak trigger can be more than one term representing a plant–chemical relationship. On the other hand, the strong trigger should be a single word that is thought to be the most representative word explaining a plant–chemical relationship. In the example of the sentence “The [ calcium ] chemical contents were highest in the [ papaya ] plant ,” “were highest in” and “in” are the weak trigger term and the strong trigger term, respectively.
Annotating the corpus
This subsection describes the annotation process for plants, chemicals, and their relationships in all the candidate corpus units as shown in Fig. 1. The annotation task for the corpus was performed by three annotators with a basic knowledge of biology and traditional Chinese medicine. The main annotator performed sentence collection and preprocessing, built the annotation guidelines, and carried out the first annotation of the corpus. Two assistant annotators participated in the corpus annotation.
P.Check: Write the letter “O” when all of the contents in “PlantName,” “P.ID,” and “P.Off” shown in Fig. 2 are correctly annotated and write the letter “X” when any of them are incorrectly annotated.
P.Note: Leave comments about incorrectly annotated contents in “PlantName,” “P.ID,” and “P.Off.” This field is optional.
C.Check: Write the letter “O” when all of the contents in “ChemName,” “C.ID,” and “C.Off” shown in Fig. 2 are correctly annotated and write the letter “X” when any of them are incorrectly annotated.
C.Note: Leave comments about incorrectly annotated contents in “ChemName,” “C.ID,” and “C.Off.” This field is optional.
Label: Write “POS” or “NEG” according to the relationship type between a plant, colored in red, and a chemical, colored in green, in the “Sentence” column. “POS” indicates that a plant includes a chemical while “NEG” means that there is no positive relationship between them.
Weak Trigger: Write trigger terms, which represent the positive relationship between a plant and a chemical, in as broad a range as possible. For example, the weak trigger in the first corpus unit in Fig. 2 can be “were the highest in,” whereas annotators write “NA” when the “Label” column is denoted as “NEG.” This field is required only when the sentence contains a positive relationship between plant and chemical. Leave this field empty when the “Label” is annotated as “NEG.”
Strong Trigger: Unlike the weak trigger, the strong trigger should be the single word regarding by the annotators as the most meaningful word that represents the positive relationship between a plant and a chemical. For example, the strong trigger in the first corpus unit in Fig. 2 can be “in,” whereas annotators write “NA” when the “Label” column is denoted as “NEG.” This field is required only when the sentence contains a positive relationship between plant and chemical. Leave this field empty when the “Label” is annotated as “NEG.”
After all annotators completed the annotation task, the main annotator collected annotation results to classify agreements or disagreements.
Constructing a rule development corpus and a rule-based model
We developed a rule-based model to extract plant–chemical relationships from articles. For this task, the main annotator additionally constructed “a rule development corpus,” which is not a duplicate of the primary corpus described in the Annotating the corpus subsection. Then, the performance of the rule-based model was tested against the primary corpus.
For example, the tree in Fig. 3 shows the dependency structure of a passive sentence, “About 450 mg of FB1 was obtained from 800 g cultured corn.” The noun phrase, “About 450 mg of FB1,” is connected with “nsubjpass” type, which is a passive nominal subject of a passive clause. The noun phrase, “800 g cultured corn,” is connected with “pobj” type, which is an object of a preposition. Then, we can observe that there is a positive relationship between the noun phrase containing a plant name (“corn”) and the noun phrase containing a chemical name (“FB1”) through the trigger, “were obtained from.” Based on such observations, we define a verbal trigger rule when the trigger type is a transitive form, which is defined below. For new inputs that may have a similar dependency structure with Fig. 3, the rule-based model compares the dependency tree of the new inputs with the verbal trigger rule by checking the following: (i) if there is a noun phrase containing a chemical name, which is connected with “nsubjpass” type; (ii) if there is a noun phrase containing a plant name, which is connected with “pobj” type that is an object of a preposition; and (iii) if the root term “obtained,” which is linked with “nsubjpass” and “prep,” belongs to one of trigger words defined in our trigger dictionary.
The rule specifications
Rule specification type
Trigger word form
Verbal trigger rule
N P 0 V tr N P 1
Transitive verb (active from)
[ Pomegranate derived from the tree Punica granatum] [ contains] [anthocyanins ]. (PMID: 15493960)
N P 1 V tr PP N P 0
Transitive verb (passive from)
Any preposition between V tr and N P 0
[About 450 mg of FB1 ] [ were obtained ] [ from ] [800g cultured corn ]. (PMID: 23605447)
N P 0 V tr PP N P 1
Any preposition between V tr and N P 1
[The volatile oil (2-3 %) of ginger ] [ consists ] [ of ] mainly [mono and sesquiterpenes ]. (PMID: 17637489)
Preposition trigger rule
N P 0 P P tr N P 1
[ switchgrass ] [ as ] [a sole carbon (C) source]. (PMID: 22354956)
N P 1 P P tr N P 0
[ Saponins ] [ from ] [the flowers of Panax notoginseng ]. (PMID: 20518315)
Relative trigger rule
N P 1 R tr PP N P 0
Past participle form
Any preposition between R tr and N P 0
[ Anthocyanins ] [ isolated ] [ from ] [ black soybean seed coat]. (PMID: 16457818)
N P 0 R tr (PP) N P 1
When the trigger word (R tr ) is “consisting,” preposition (PP), “of,” should be followed by R tr .
With [thermally degraded Feverfw powder] [ containing ] [less contents of parthenolide ] no built-up antiserotonergic responses were observed after one month. (PMID: 11603284)
Apposition trigger rule
N P 0 A P tr N P 1
Apposition form (e.g. comma)
The token distance between N P 0 and N P 1 should be within ten.
Whereas that in PD is [ soybean oil ][ , ] [a source of unsaturated fatty acids ]. (PMID: 19932903
N P 1 A P tr N P 0
Apposition form (e.g. comma)
The token distance between N P 0 and N P 1 should be within ten.
[ Delta9-tetrahydrocannabinol (THC)][ , ] [the major active component of marijuana ]. (PMID: 9129126)
Copula trigger rule
N P 0 C tr N P 1
Be verb form
The token distance between N P 0 and N P 1 should be within ten.
[ Haematococcus pluvialis ] [ is ] [one of the potent organisms for production of astaxanthin ]. (PMID: 23605447)
N P 1 C tr N P 0
Be verb form
The token distance between N P 0 and N P 1 should be within ten.
[The calcium contents] [ were ] the highest in [the papaya ]. (PMID: 21695915)
Compound noun trigger rule
N P 0 C N tr N P 1
To study the protective effect of [ panax notoginseng ] [ saponins (PNS)]. (PMID: 19317166)
A verbal trigger rule: As shown in Table 1, the specification for the verbal trigger has three types according to the trigger type. The first rule structure is defined as N P 0 V tr N P 1, where the trigger type of V tr is a transitive active form such as “contain” and “include.” The second rule structure consists of N P 1 V tr PP N P 0. In this case, the trigger type is a transitive passive form such as “be contained” and “be extracted.” Thus, a preposition denoted as “PP” is required to be located between V tr and N P 0, which is the constraint. In the third rule structure, V tr should be an intransitive verb such as “consist.” Therefore, any preposition such as “of” should be placed next to V tr in the rule structure, N P 0 V tr PP N P 1.
A prepositional trigger rule: The rule structure has the following two cases: N P 0 P P tr N P 1 and N P 1 P P tr N P 0. Both N P 0 and N P 1 are allowed to be located on the left and right side of the trigger. Any preposition can be placed in P P tr .
A relative trigger rule: This consists of two types of specifications according to the trigger type. The first rule structure is defined as N P 1 R tr PP N P 0, where R tr is the past participle form such as “isolated” and “extracted,” and any preposition must be located between R tr and N P 0. The second rule structure is described as N P 0 R tr (PP) N P 1, where the trigger type R tr is a gerund form such as “containing” and “having.” Note that, in this case, the preposition such as “of” should be located between R tr and N P 1 when the trigger word is the gerund form of the intransitive verb such as “consisting.”
An apposition trigger rule: The rule structure has the following two cases: N P 0 A P tr N P 1 and N P 1 A P tr N P 0. In this specification, triggers can be any token whose dependency type is apposition. For example, a comma between N P 0 and N P 1 can indicate that N P 0 is in apposition with N P 1 or vice versa. Furthermore, we simply set a constraint that a token distance between N P 0 and N P 1 is less than ten to avoid the case that does not represent a relationship between them.
A copula trigger rule: The rule structure has the following two cases: N P 0 C tr N P 1 and N P 1 C tr N P 0. In this specification, a trigger “ C tr ” can be any verb regardless of tense if its dependency type is copula. To reduce false positives, we also set a simple constraint that there must be a token distance between N P 0 and N P 1 of less than ten.
A compound noun trigger rule: The rule structure is N P 0 C N tr N P 1, where the trigger C N tr is a single white space between N P 0 and N P 1. For example, “Aloe emodin” indicates that emodin is one of the ingredients in aloe.
Trigger words used in the rule-based model. The table shows trigger words selected for the six predefined rules in the model
Trigger word form
Verbal trigger (V tr )
contain, contains, contained
have, has, had
involve, involves, involved
incorporate, incorporates, incorporated
possess, possesses, possessed
encompass, encompasses, encompassed
subsume, subsumes, subsumed
comprise, comprises, comprised
embody, embodies, embodied
embrace, embraces, embraced
include, includes, included
cover, covers, covered
compose, composes, composed
originate, originates, originated
produce, produces, produced
derive, derives, derived
accumulate, accumulates, accumulated
release, releases, released
contained, involved, incorporated, possessed,
encompassed, subsumed, comprised, embodied,
embraced, included, covered, composed, produced,
originated, derived, accumulated, released, isolated,
extracted, separated, detached, split, segregated,
obtained, found, gained, discovered, uncovered, identified
Prepositional trigger (P P tr )
any token whose dependency type is “prep”
Relative trigger (R tr )
Past participle form
see trigger words in the passive form section above (same as passive form)
containing, involving, incorporating, possessing,
encompassing, subsuming, comprising,
embracing, including, covering, composing,
embodying, producing, originating, deriving,
accumulating, releasing, having, consisting
Apposition trigger (A P tr )
any token whose dependency type is “appos”
Copula trigger (C tr )
any token whose dependency type is “cop”
Compound noun trigger (C tr )
Compound noun form
strings that are made up of plant and chemical names together (e.g. panax ginseng saponin)
Overall process for extracting plant–chemical relationships
Inter-annotator agreement (IAA)
To assess the accuracy of the corpus, we calculated the IAA scores using simple percent agreement and Cohen’s kappa statistic. The simple percent agreement is calculated as follows: [a g r e e m e n t s/(a g r e e m e n t s+d i s a g r e e m e n t s)]×100 %. Cohen’s kappa is the most frequently used method for measuring the overall agreement between two annotators, and it is generally regarded as a more robust measurement than the simple agreement calculation. According to , kappa values within the range 61 to 80 % are considered as “substantial” agreement, and values between 81 and 99 % constitute “almost perfect” agreement.
Statistics and IAA scores for entities, relation labels, and triggers. The IAA scores for entities and relation labels were calculated using a simple percent agreement. The IAA score for relation labels was calculated using Cohen’s kappa statistic
# of entities
# of agreements for entities (IAA, Simple)
# of class labels
# of agreements for relation labels (IAA, Kappa)
# of triggers
# of agreements for triggers (IAA, Simple)
640 (99.7 %)
570 (77.6 %)
275 (96.8 %)
636 (99.1 %)
271 (95.4 %)
1276 (99.4 %)
546 (96.1 %)
401 (100.0 %)
369 (82.8 %)
234 (95.5 %)
400 (99.8 %)
223 (91.0 %)
801 (99.9 %)
457 (93.3 %)
2,077 (99.6 %)
939 (79.8 %)
1,003 (94.8 %)
Statistics and IAA scores of annotated corpus units. The number of agreements was counted when annotators agree on both entities and relation labels. The IAA score in this case was calculated using the Cohen’s kappa statistic
# of corpus units
# of agreements for both entities and relations
IAA score, Kappa
Example 1. [Allyl isothiocyanate] chemical (AITC) is [a constituent of] trigger several plants of the family [Cruciferae]plant that are commonly used as food.
For the corpus unit in Example 1, one annotator assigned a negative relationship due to misinterpreting the meaning of the sentence. Actually, the sentence includes a positive relationship between the plant “Cruciferae” and the chemical “Allyl isothiocyanate” because they are closely linked with the trigger “a constituent of.”
Example 2. 1,3-dichloropropene (1,3-D) was evaluated as a potential alternative for the widely used soil fumigant [methyl bromide] chemical (MeBr) [in] trigger [cucumber (Cucumis sativus Linn.)] plant crops in China.
For the corpus unit in Example 2, one annotator interpreted the chemical “methyl bromide” as one of the ingredients originating from the plant “cucumber.” However, the chemical “methyl bromide” is a powerful pesticide for cultivating cucumber crops, not an original component of the plant. As such, misunderstanding of the origin of ingredients can induce disagreement cases.
Example 3. The authors studied the changes in subjective symptoms of menopause in 2016 Hungarian women who had been treated with an [isopropanol] chemical extract [of] trigger [Cimicifuga racemosa] plant (black cohosh).
For the corpus unit in Example 3, one annotator annotated it as a positive relationship between the plant “Cimicifuga racemosa” and the chemical “isopropanol.” In fact, as described in the guidelines, the chemical “isopropanol” is used in a common extraction method for deriving active ingredients from plants. In this case, annotations between annotators were different due to disregarding the guidelines.
Evaluating the rule-based model
We constructed the rule development corpus consisting of 102 corpus units (50 positives and 52 negatives) from 50 sentences, and this corpus is also provided at corpus website. The rule-based model was developed based on the rule development corpus. We measured the performance of the rule-based model using 939 corpus units in the primary corpus. The 939 corpus units that are inputted to the rule-based model already contain annotation information for entity names, triggers, and their locations in the texts. Then, the model performs the following steps for a given input corpus: (i) apply the Stanford Dependency Parser to the corpus units to obtain dependency parse trees; (ii) check whether there are dependency parse trees that are structurally matched to one of rules defined in the model; (iii) seek trigger words from the sentence only if dependency parse trees of corpus units are matched to one of the rules in the previous step; and (iv) if the trigger word is detected, the system recognizes that there is a positive relationship between plant and chemical for the corpus. Finally, we obtained predicted class labels (positive or negative) of 939 corpus units and compared them with the originally annotated class labels.
Evaluation of a rule-based text mining model to extract plant–chemical relationships using the corpus data
We performed an additional evaluation of the rule-based model by applying it to 43 randomly selected PubMed abstracts. The abstracts contain a total of 59 co-occurrence sentences containing both plant and chemical names. Two annotators manually annotated 113 corpus units from 59 co-occurrence sentences. When we applied the rule-based model to the 113 newly annotated corpus units, it achieved an F-score of 61.81 % (precision = 62.96 %, recall = 60.71 %). This new corpus contains 28 positive and 85 negative relationships. Because the new corpus was constructed at the abstract level, the ratio of positive and negative sentences is different from the primary corpus in our work, where the numbers of positive and negative relationships are similar. Although it contains more negative relationships, the performance of the rule-based model is similar to that of phase 1 in the primary corpus, showing the rule-based model can be used for a corpus at the abstract level as well as a corpus at the sentence level.
In this study, we constructed a plant–chemical corpus that can facilitate the development of a relationship extraction system for collecting plant–chemical relationships from texts. Thus, we have introduced guidelines for annotating sentences that represent relationships between plants and chemicals and also described how we constructed the corpus. As a result, we have identified a total of 1,007 plant–chemical corpus units from 377 gold-standard sentences that were selected from 245 PubMed abstracts.
In recent NLP studies, machine learning approaches are more dominantly employed and frequently perform better than rule-based approaches for many NLP tasks . However, the rule-based approach has some advantages; for instance, it is easy to incorporate domain knowledge and to fix the cause of errors although it requires extensive manual labor . In addition, in the process of constructing general rules by analyzing sentences, domain knowledge can be accumulated while machine learning approaches usually work as a black box. In our work, with a relatively small number of 102 corpus units in the rule development corpus, we developed the rule-based model for extracting plant–chemical relationships. Its performance achieved an F-measure of 68 % on 939 corpus units. For comparison, we employed the Turku Event Extraction System (TEES) , which is a support vector machine based text mining system. Although TEES was originally developed for extracting relationships between genes and biological events, it can be modified for extracting any binary relationship. We trained TEES with the same 102 corpus units included in the rule development corpus. Because TEES requires a development set, the 68 corpus units that are agreed upon among annotators after resolving the disagreements, was used for the development set. We applied 939 corpus units in the primary corpus to the model. This resulted in an F-measure of 25.7 % on the corpus units, which is poor compared to the rule-based model. This is because the training set was small. Hence, we again trained and evaluated the TEES model using a ten-fold cross validation on a larger set of 1,109 corpus units, including the following: (i) 939 corpus units from the primary corpus; (ii) 68 corpus units that are agreed upon among annotators after harmonizing the disagreements; and (iii) 102 rule development corpus units. Note that 798, 200, and 111 corpus units were used as training, development, and test sets, respectively, for the performance measurement in the ten-fold cross validation. As a result, it achieved an F-measure of 77.7 %. In this respect, it is possible to apply our corpus data to various machine learning methods to achieve better performance.
Our future work is to develop and apply a text mining model to the prediction of plant–chemical relationship in all the abstracts in PubMed based on the corpus developed in this work. This is challenging work because we need to improve NER models and the relationship prediction model by combining the rule-based model and a machine learning approach. It also requires validating the final accuracy of plant–chemical relationships predicted from all abstracts. When this future work is successful, we expect that we can obtain a large number of plant–chemical relationships. With the increasing importance of natural products, knowledge about active compounds in herbs has become significant. Numerous diseases can be treated by herb compounds, and various medicines, cosmetics, and other products have widely used herbal compounds as their main components. Thus, we believe that our research provides an important step related to herbs or natural products in text-mining studies.
This work was supported by the Bio-Synergy Research Project (NRF-2013M3A9C4078138) of the Ministry of Science and the GIST Research Institute (GRI) in 2016.
All the authors shared the responsibility for annotating the corpus in this paper. WC is the main annotator and developed the rule-based model. HL and DL initiated the study. All the authors participated in writing the manuscript. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Bennett BC, Prance GT. Introduced plants in the indigenous pharmacopoeia of northern south america. Econ Bot. 2000; 54(1):90–102.View ArticleGoogle Scholar
- Calixto JB. Twenty-five years of research on medicinal plants in latin america: a personal view. J Ethnopharmacol. 2005; 100(1):131–4.View ArticlePubMedGoogle Scholar
- O’Hara M, Kiefer D, Farrell K, Kemper K. A review of 12 commonly used medicinal herbs. Arch Fam Med. 1998; 7(6):523.View ArticlePubMedGoogle Scholar
- Esmat AY, Said MM, Soliman AA, El-Masry KS, Badiea EA. Bioactive compounds, antioxidant potential, and hepatoprotective activity of sea cucumber (holothuria atra) against thioacetamide intoxication in rats. Nutrition. 2013; 29(1):258–67.View ArticlePubMedGoogle Scholar
- Han JH, Koh W, Lee HJ, Lee HJ, Lee EO, Lee SJ, Khil JH, Kim JT, Jeong SJ, Kim SH. Analgesic and anti-inflammatory effects of ethyl acetate fraction of polygonum cuspidatum in experimental animals. Immunopharmacol Immunotoxicol. 2012; 34(2):191–5.View ArticlePubMedGoogle Scholar
- Bjorne J, Ginter F, Pyysalo S, Tsujii J, Salakoski T. Ganghwaljetongyeum, an anti-arthritic remedy, attenuates synoviocyte proliferation and reduces the production of proinflammatory mediators in macrophages: the therapeutic effect of ghjty on rheumatoid arthritis. BMC Complement Altern Med. 2013; 13(1):1.View ArticleGoogle Scholar
- Koehn FE, Carter GT. The evolving role of natural products in drug discovery. Nat Rev Drug Discov. 2005; 4(3):206–20.View ArticlePubMedGoogle Scholar
- Zhao J, Jiang P, Zhang W. Molecular networks for the study of tcm pharmacology. Brief Bioinform. 2010; 11(4):417–30.View ArticlePubMedGoogle Scholar
- Wang L, Zhou GB, Liu P, Song JH, Liang Y, Yan XJ, Xu F, Wang BS, Mao JH, Shen ZX, Chen SJ, Chen Z. Dissection of mechanisms of chinese medicinal formula realgar-indigo naturalis as an effective treatment for promyelocytic leukemia. Proc Natl Acad Sci. 2008; 105(12):4826–831.View ArticlePubMedPubMed CentralGoogle Scholar
- Chen CYC. Tcm database@taiwan: the world’s largest traditional chinese medicine database for drug screening in silico. PloS ONE. 2011; 6(1):15939.View ArticleGoogle Scholar
- Xue R, Fang Z, Zhang M, Yi Z, Wen C, Shi T. Tcmid: traditional chinese medicine integrative database for herb molecular mechanism analysis. Nucleic Acids Res. 2013; 41(D1):D1089-95.View ArticlePubMedGoogle Scholar
- Chen X, Zhou H, Liu YB, Wang JF, Li H, Ung CY, Han LY, Cao ZW, Chen YZ. Database of traditional chinese medicine and its application to studies of mechanism and to prescription validation. Br J Pharmacol. 2006; 149(8):1092–1103.View ArticlePubMedPubMed CentralGoogle Scholar
- Ye H, Ye L, Kang H, Zhang D, Tao L, Tang K, Liu X, Zhu R, Liu Q, Chen YZ, Li Y, Cao Z. Hit: linking herbal active ingredients to targets. Nucleic Acids Res. 2011; 39(suppl 1):1055–1059.View ArticleGoogle Scholar
- Kuhn M, von Mering C, Campillos M, Jensen L, Bork P. Stitch: interaction networks of chemicals and proteins. Nucleic Acids Res. 2008; 36(suppl 1):684–8.Google Scholar
- Hamosh A, Scott AF, Amberger J, Valle D, McKusick VA. Online mendelian inheritance in man (omim). Hum Mutat. 2000; 15(1):57–61.View ArticlePubMedGoogle Scholar
- Wishart DS, Knox C, Guo AC, Shrivastava S, Hassanali M, Stothard P, Chang Z, Woolsey J. Drugbank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 2006; 34:668–72.View ArticleGoogle Scholar
- Federhen S. The NCBI Taxonomy database. Nucleic Acids Res. 2012; 40(Database issue):D136–43.View ArticlePubMedGoogle Scholar
- Jenson K, Panagiotou G, Kouskoumvekaki I. Integrated text mining and chemoinformatics analysis associates diet to health benefit at molecular level. PLoS Comput Biol. 2014; 10(1):1003432.View ArticleGoogle Scholar
- Marcus M, Santorini S, Marcinkiewicz M. Building a large annotated corpus of english: the penn treebank. Comput Linguist. 1993; 19(2):313–30.Google Scholar
- Comeau DC, Dogan RI, Ciccarese P, Cohen KB, Krallinger M, Leitner F, Lu Z, Peng Y, Rinaldi F, Torii M, Valencia A, Verspoor K, Wiegers TC, Wu CH, Wilbur WJ. Bioc: a minimalist approach to interoperability for biomedical text processing. Database. 2013; 2013:bat064.View ArticlePubMedPubMed CentralGoogle Scholar
- Gerner M, Nenadic G, Bergman CM. Linnaeus: a species name identification system for biomedical literature. BMC Bioinforma. 2010; 11(1):1.View ArticleGoogle Scholar
- Pafilis E, Frankild SP, Fanini L, Faulwetter S, Pavloudi C, Vasileiadou A, Arvanitidis C, Jensen LJ. The species and organisms resources for fast and accurate identification of taxonomic names in text. PLoS ONE. 2013; 8(6):65390.View ArticleGoogle Scholar
- Krallinger M, Rabal O, Leitner F, Vazquez M, Salgado D, Lu Z, Leaman R, Lu Y, Ji D, Lowe DM, Sayle RA, Batista-Navarro RT, Rak R, Huber T, Rocktaschel T, Matos S, Campos D, Tang B, Xu H, Munkhdalai T, Ryu KH, Ramanan SV, Nathan S, Zitnik S, Bajec M, Weber L, Irmer M, Akhondi SA, Kors JA, Xu S, An X, Sikdar UK, Ekbal A, Yoshioka M, Dieb TM, Choi M, Verspoor K, Khabsa M, Giles CL, Liu H, Ravikumar KE, Lamurias A, Couto FM, Dai H, Tsai RT, Ata C, Can T, Usie A, Alves R, Segura-Bedmar I, Martinez P, Oryzabal J, Valencia A. The chemdner corpus of chemicals and drugs and its annotation principles. J Cheminformatics. 2015; 7(1):1.View ArticleGoogle Scholar
- Li J, Sun Y, Jonhnson R, Sciaky D, Wei C, Leaman R, Davis AP, Mattingly C, Wiegers T, Lu Z. Annotating chemicals, diseases and their interactions in biomedical literature. In: Proceedings of the fifth BioCreative challenge evaluation workshop, BioCreative Organizing Committee. Sevilla: 2015. p. 173–182.Google Scholar
- Wei CH, Kao HY, Lu Z. Pubtator: a web-based text mining tool for assisting biocuration. Nucleic Acids Res. 2013; 41(Web Server issue):W518–22.View ArticlePubMedPubMed CentralGoogle Scholar
- Leaman R, Dogan RI, Lu Z. Dnorm: disease name normalization with pairwise learning to rank. Bioinformatics. 2013; 29(22):2909–17.View ArticlePubMedPubMed CentralGoogle Scholar
- Leaman R, Wei CH, Lu Z. tmchem: a high performance approach for chemical named entity recognition and normalization. J Cheminformatics. 2015; 7(1):1.View ArticleGoogle Scholar
- Wiegers T, Davis A, Cohen KB, Hischman L, Mattingly C. Text mining and manual curation of chemical-gene-disease networks for the comparative toxicogenomics database (ctd). BMC Bioinforma. 2009; 10(1):1.View ArticleGoogle Scholar
- Garten Y, Altman RB. Pharmspresso: a text mining tool for extraction of pharmacogenomic concepts and relationships from full text. BMC Bioinforma. 2009; 10(2):1.Google Scholar
- Baldwin B, Carpenter B. LingPipe. http://www.alias-i.com/lingpipe. Accessed 19 Jan 2015.
- Rocktaschel T, Weidlich M, Leser U. Chemspot: a hybrid system for chemical named entity recognition. Bioinformatics. 2012; 28:1633–1640.View ArticlePubMedGoogle Scholar
- Viera AJ, Garrett JM. Understanding interobserver agreement: the kappa statistic. Fam Med J. 2005; 37:360–3.Google Scholar
- Cohen KB, Verspoor K, Johnson HL, Roeder C, Ogren PV, Jr WAB, White E, Tipney H, Hunter L. High-precision biological event extraction: Effects of system and of data. Comput Intell. 2011; 27(4):681–701.View ArticlePubMedPubMed CentralGoogle Scholar
- Chiticariu L, Li Y, Frederick FR. Rule-based information extraction is dead! long live rule-based information extraction systems!. In EMNLP. 2013; October:827–32.Google Scholar
- Björne J, Salakoski T. Tees 2.1: Automated annotation scheme learning in the bionlp 2013 shared task. In: Proceedings of the BioNLP Shared Task 2013 Workshop, Association for Computational Linguistics (ACL); Sofia: 2013. p. 16–25.Google Scholar