Semantic annotation of morphological descriptions: an overall strategy
© Cui; licensee BioMed Central Ltd. 2010
Received: 17 December 2009
Accepted: 25 May 2010
Published: 25 May 2010
Large volumes of morphological descriptions of whole organisms have been created as print or electronic text in a human-readable format. Converting the descriptions into computer- readable formats gives a new life to the valuable knowledge on biodiversity. Research in this area started 20 years ago, yet not sufficient progress has been made to produce an automated system that requires only minimal human intervention but works on descriptions of various plant and animal groups. This paper attempts to examine the hindering factors by identifying the mismatches between existing research and the characteristics of morphological descriptions.
This paper reviews the techniques that have been used for automated annotation, reports exploratory results on characteristics of morphological descriptions as a genre, and identifies challenges facing automated annotation systems. Based on these criteria, the paper proposes an overall strategy for converting descriptions of various taxon groups with the least human effort.
A combined unsupervised and supervised machine learning strategy is needed to construct domain ontologies and lexicons and to ultimately achieve automated semantic annotation of morphological descriptions. Further, we suggest that each effort in creating a new description or annotating an individual description collection should be shared and contribute to the "biodiversity information commons" for the Semantic Web. This cannot be done without a sound strategy and a close partnership between and among information scientists and biologists.
Semantic annotation of morphological descriptions may be at "clause" or "character" level. A clause is a segment of text terminated by a semicolon (;) or period (.). In morphological descriptions, a clause may not be a grammatical sentence (see examples in Figure 1). Clause-level annotation labels individual clauses with a meaningful tag, while character-level annotation identifies character/state pairs for describing organs. In addition, a method distinguishing description paragraphs for nomenclature, distribution, and other types of sections is also needed. Figure 1 illustrates the three levels of annotation in XML for a plant description from the Flora of North America .
The inserted tags bring a computer's "understanding" of morphological descriptions to a higher level that would support more intelligent usages of the information than keyword-based search. Besides improving the accuracy of information retrieval, the tags make it possible for a computer to quickly merge or compare different descriptions organ by organ and character by character. This new capability will impact comparative biological research, the methods used in generating identification keys, and even the way an editor reviews manuscripts [2, 3].
Relevant research in annotating biosystematic literature will be reviewed next.
Methods used for semantic annotation of taxonomic documents
A syntactic parsing technique was used by a number of earlier projects. Taylor and Abascal & Sanchenz hand-crafted a set of simple grammar rules and a small lexicon specifically for extracting character states from several Floras [4, 5]. Taylor's performances were not scientifically evaluated but estimated at 60% to 80% recall.
The major advantage of the syntactic parsing approach lies in the ease of constructing a parser once the lexicon and grammar rules are prepared. The main drawback is precisely its reliance on the lexicon and grammar rules. Because of the diverse terminologies and the deviated syntax from natural language (see Characteristics of morphological descriptions in the Results section), preparing lexicons and grammar rules for each individual collection or taxon group would be prohibitively expensive.
Rules called "regular expression patterns" that rely on the regularity in the style and the use of punctuation marks were hand-crafted and found to be useful for extracting nomenclature and distribution information [6–8]. However, this approach is not useful for morphological descriptions, because of (1) the lack of such regularity in morphological descriptions and (2) the low reusability of such rules on a different description collection. Lydon et al. showed that the narratives on five common species were so different among six English Floras that only 9% of information was expressed in the same way .
Supervised learning technique was also used by Cui, in which the algorithm performed clause level annotation by learning from training examples what are called "association rules," which are less sensitive to text variations, compared to the extraction pattern discussed above . Its annotation accuracy ranged between upper 80% to upper 90% on three different Floras (Flora of North America, Flora of China, and Flora of North Central Texas) of over ten thousand descriptions [1, 12–15].
Machine learning has the advantage over manual work in its ability to programmatically evaluate its learning and adjust candidate patterns/rules based on what is seen in the training examples. However, the need for training examples is also a shortcoming, as training examples must be prepared for different taxon groups and even different collections. More importantly, if certain organs/characters are not included in the predefined extraction targets, they will be quietly ignored, resulting in loss of information. This "inadequate template" problem was also noted by Wood et al., where manually-created dictionaries, an ontology, and a lookup list were used to extract and correlate characters/states from a set of 18 plant species descriptions . They had to tag organs that were not in their lists "UnknownPlantPart." Wood et al. used parallel text to find three times more targeted information, which would otherwise be missed, and improved extraction recall three times. Diederich, Fortuner & Milton reported a system called Terminator, which is very similar to Wood et al.'s in that they both use a hand-crafted domain ontology to support character extraction .
Review of the existing annotation techniques.
Handmade prerequisites and their reusability
Results and their reusability
Scope of evaluation
1. Abascal & Sanchenz (1999)
2. Taylor (1995)
Lexicon & grammar rules:
Not good for another taxon group/collection.
1. Style clues: Less reusable.
2. Organ names & character states: Reusable.
1. FNA v. 19
2. Flora of New South Wales, Flora of Australia.
1. Not reported
2. Roughly estimated recall:60%-80%
Supervised machine learning--text classification: Cui & al. (2002)
Training examples: Not good for another taxon group.
Classification models: Less reusable.
1500+ descriptions from FNA
Recall: 94% Precision: 97%
Ontology based extraction:
1. Diederich, Fortuner & Milton (1999)
2. Wood & al. (2003)
ontology, & checklists:
Not good for another taxon group.
Organ names & character states:
1. 16 descriptions
2. 18 species descriptions from six Floras.
1. Accuracy on 1 sample:76%
2. Recall: 66%
Supervised machine learning--extraction patterns: Tang & Heidorn (2007)
Extraction template & training examples:
Not good for another taxon group.
Character, limit to these character states: leaf shape, size, color; Fruit type.
Extraction patterns: Sensitive to text variations, less reusable.
Character states: Reusable.
1600 FNA species
Supervised machine learning--
association rules: Cui (2008a)
Annotation template & training examples:
Not good for another taxon group.
Association rules: Reusable only within the same taxon group
16,000 descriptions from FNA, FOC, and FNCT
Recall and precision: 80%-95%
Unsupervised learning: Cui (2008b)
Organ names & character states:
FNA, FOC, & Treatises Part H
The techniques reviewed here all have their strengths, despite weaknesses, yet, when facing the millions of OCRed text descriptions produced by the Biodiversity Heritage Library, none of them seems to be both effective and efficient . The reason, we suggest, lies in the special characteristics of the morphological descriptions.
In this section we present our exploratory results on the characteristics of morphological descriptions and on an unsupervised machine learning strategy. Insights gained via these exercises give rise to an overall strategy for semantic annotation of morphological descriptions, which we shall discuss at the end of this section.
Characteristics of morphological descriptions
The performance of a semantic annotation technique depends, at least in part, on the characteristics of the documents to be annotated. A technique that identifies organ names by looking for bold words, for example, is not very useful for the task overall, because many descriptions are not styled that way. Here, in our search for a sound overall strategy to mark up all morphological descriptions in English, we consider some general characteristics of morphological descriptions which are challenging or beneficial for an automated semantic annotation technique.
1. Challenging characteristics
Diverse terminology: each biodiversity branch has a more or less distinct set of terminology. Not only are terms used in brachiopod (Animalia) descriptions different from those in plant descriptions, but terms in one plant family description are somewhat different from those in another. Several previous researchers (e.g. Wood et al., and Cui & Heidorn) have reported that when applying a system crafted from one set of documents to a different set, new concepts that were unknown to the system were encountered, forcing an automated system to work in an interactive and iterative fashion to incorporate new concepts along the way [16, 19].
Diverse meanings: While it is well-known that the same word could have different meanings in different domains, the exact meaning of a term in one taxon group is not always well-defined either. For example, the term "erect" takes on a number of different meanings depending on which botanical thesaurus one consults: the FNA Glossary defines "erect" as a state of orientation, the Oxford Virtual Field Herbarium Plant Characteristics defines it as a state of habit, and two different versions of PATO ontology labeled the concept placement and position respectively [21–23]. Cui conducted a comparison of four machine-readable glossaries in botany (including the above-mentioned three) and found that among 1964 character states extracted from five volumes of FNA and four volumes of FOC, 64 were included by all four glossaries, and only 12 of the 64 were given the same definition by all four glossaries . In the biomedical domain, UMLS (the Unified Medical Language System) is being built since 1986 to bridge different biomedical thesauri. Natural language processing in the biodiversity domain needs a comparable ontological infrastructure. Without consolidating ambiguous definitions, the ability for different annotated collections to communicate with each other is lost, defeating the purpose of semantic annotation.
2. Beneficial characteristics
Word repetition in morphological descriptions.
Treatise Part H
An unsupervised learning method
With the understanding of challenging and beneficial characteristics of morphological descriptions, Cui explored an unsupervised learning method that discovered organ names and character states directly from descriptions, without being limited by any templates. The algorithm takes advantage of the deviated syntax and works without any lexicons, extraction templates, or training examples . Therefore, the algorithm is expected to work on descriptions of any taxon group written in the deviated syntax. This removes or significantly reduces the manual labor required to craft parsers, templates, or training examples on a collection by collection basis. Different from the supervised learning approach, the unsupervised algorithm identifies organ names and character states mentioned in morphological descriptions by bootstrapping between the subjects (which are typically organ names) and the subsequent words (called "boundary words," over 90% of which are character states) in the clauses . To illustrate the idea, for example, the algorithm is primed with knowledge that "petals" is an organ and can be a subject, then when the algorithm comes across the clause "petals absent," the algorithm would infer that "absent" is a state. Knowing that, the algorithm would further infer that "subtending bracts" in "subtending bracts absent" is an organ. By now, the algorithm has learned two new terms: "absent" is a state and "subtending bracts" is an organ. The algorithm continues searching through the descriptions to apply what it has already learned to discover the unknowns, until there is no new discovery to be made. The algorithm takes the advantage of the deviated yet simple syntax and the repetitive usage of the terms in morphological descriptions. While the assumption that clauses all start with an organ name followed by a state is not always true (since the same organ names or states are often repeatedly used in different combinations in descriptions), the chance for them to be discovered has been shown to be very good.
The identification of organ names is sufficient to perform clause level annotation at an accuracy of 92% to 95%. Compared with the supervised algorithm reported in Cui on the same dataset (i.e., 633 descriptions from FNA) on clause level annotation, the unsupervised algorithm achieved better performance, ran five times faster, and eliminated the need for training examples . Notably, the unsupervised algorithm marked up all clauses left out by the supervised learning algorithm due to the inadequate template problem. Organ names and character states learned by the unsupervised algorithm were significantly cleaner and more useful for marking up new descriptions or constructing domain lexicons .
The most recent evaluation on several hundred to several thousand descriptions from volume 19 (Asteraceae) of FNA and Part H (Brachiopods) of the Treatises found that 90% of the organ names learned by the algorithm were correct (precision) and that accounts for 80% to 90% of all organ names mentioned in the descriptions (recall). Over 92% to 98% of learned character states were correct and that accounts for 50% to 75% of all character states mentioned in the descriptions . A plant description correctly annotated by the algorithm is shown in Figure 1.
The unsupervised algorithm has two notable limitations. (1) While the algorithm learned organ names and character states with very good precision, the recall of character states was only in the range of 50% to75%. There is hope to further improve the recall by learning from parallel text. Wood et al. showed that the use of parallel text improved the recall threefold . (2) To fully mark up at the character level, the identified character states must be connected to their characters, and the characters to organs. However, characters are rarely explicitly mentioned in the descriptions. For example, in "stems prostrate to erect," the character to which "prostrate" and "erect" belong is only implied. As discussed earlier, "erect" may be a habit, an orientation, a position, or a placement, depending on which source one consults and when. The confusion on the implied characters is a problem for supervised and unsupervised approaches alike, but in supervised learning, a designation is often arbitrarily made (e.g., making "erect" a habit) and fixed in the extraction templates and training examples, so the issue seems to be resolved, until the annotation needs to be merged with another collection where "erect" is an orientation. Without templates and training examples, the unsupervised algorithm could logically group character states of the same character together by their co-occurrence patterns (e.g., "prostrate" and "erect" often appear together, so they are in the same group), and wait for an authority to determine what they really are. It is much easier for a domain scientist to label the group "dark brown," "chestnut-colored," and "greenish-blue" color than annotating hundreds of training descriptions. The co-occurrence patterns may provide some useful clues for an expert or a group of experts to determine a category for the more troublesome terms such as "erect."
An overall strategy for semantic annotation of biodiversity documents
Organ names and character states learned by the unsupervised technique can be used to enhance or build domain lexicons. Knowing organ names are nouns and character states are adjectives, most of the parsing errors shown in Figure 6 could be resolved; for example, knowing "flagellomere" is a noun (NN) would correct one of the parsing errors. These concepts can also be used to extend the coverage of the extraction templates used by supervised learning techniques, addressing the problem of inadequate templates. In addition, the cheap yet rather effective unsupervised algorithm may be used to mark up descriptions to obtain "weak" training examples, which can then be refined, if necessary, for supervised learning techniques.
The organ names and groups of character states discovered from literature via unsupervised learning may be selected by domain experts to be included in domain ontologies, which in return ensures the annotation produced by any annotation systems is interoperable. Domain knowledge of human experts is best used here, rather than preparing training examples collection by collection.
The learned concepts may be used for recognizing and extracting morphological description paragraphs from their parent documents--a necessary first step before morphological information can be annotated further. A description paragraph can simply be distinguished from say, a distribution section, by the density of the words representing organ names and character states. Sophisticated, supervised text classification algorithms have been used for this purpose, but they require training examples to run . We have used the concepts learned unsupervised from a portion of FNA to identify description paragraphs in other volumes with almost effortless 100% accuracy.
All marked up descriptions should ideally be deposited in a common repository as they can be training examples or otherwise helpful to either supervised or unsupervised learning techniques.
Lastly, many systematic biologists are not aware that the spreadsheets they use to draft descriptions could be easily used as training examples for supervised learning. Spreadsheets are another source (besides the literature) of distilled domain knowledge, based on which the meaning of a concept may be verified and determined.
The proposed strategy above is based on the characteristics of several biodiversity document collections we have observed. With millions of pages of biosystematic literature digitized by the Biodiversity Heritage Library and others, systematic biologists, information scientists, and others must work together to put the text into a computer-understandable and interoperable format fast so the knowledge becomes alive again. Language processing infrastructure such as domain lexicons and ontologies should be built and shared not to benefit any particular project but to stay useful for all. As the number of active taxonomists is currently declining, their time should be spent on the most challenging part of the puzzle, namely defining the meaning of domain concepts, so domain ontologies become useful and exert lasting power for a long time to come. A strategy that would lead us to the ultimate goal of a "biodiversity information commons" on the Semantic Web faster involves computer scientists using and developing low-cost unsupervised learning methods for annotating the literature directly or feeding more expensive supervised-learning approaches. But more important than anything else, domain scientists are needed to share their character matrices as training data and to verify learning results produced by the algorithms (including lexicons, ontologies, and annotated documents). Resources should be directed to develop reusable knowledge entities, including benchmarks for evaluating system performances, in standard formats for an accumulative growth of computer-usable knowledge.
We have experimented with a number of semantic annotation techniques and learned the characteristics of morphological descriptions over time. These experiences have led us to the overall strategy proposed above. With the support of an NSF grant and a group of enthusiastic domain scientists, we are implementing the strategy, including developing the unsupervised learning algorithm and using it to help lexicon and ontology constructions. All will be further developed and tested on different taxon groups for character-level annotation and released for public download by 2011. Post-2011 we plan to make use of the lexicons and ontologies produced to annotate biodiversity-related, true natural language text. Along the way we hope to develop standard benchmark datasets for algorithm evaluation in the biodiversity domain.
This research is in part supported by an NSF grant EF- 0849982 and a grant from the Flora of North America Project.
- Flora of North America Editorial Committee (Eds): Flora of North America[http://www.fna.org/]
- Tang X, Heidorn PB: Using automatically extracted information in species page retrieval. Proceedings of TDWG 2007. [http://www.tdwg.org/proceedings/article/view/195]Google Scholar
- Cui H, Macklin J, Yu C: Application of semantic annotation for quality insurance in biosystematics publishing. Proceedings of the Annual Meeting of American Society of Information Science and Technology 2009 (in CD) 2009.Google Scholar
- Taylor A: Extracting knowledge from biological descriptions. Proceedings of 2nd International Conference on Building and Sharing Very Large-Scale Knowledge Bases 1995, 114–119.Google Scholar
- Abascal R, Sanchenz J: X-tract: Structure extraction from botanical textual descriptions. Proceeding of the String Processing & Information Retrieval Symposium 1999, 2–7.Google Scholar
- Kirkup D, Malcolm P, Christian G, Paton A: Towards a Digital African Flora. Taxon 2005, 54(2):457–466. 10.2307/25065373View ArticleGoogle Scholar
- Sautter G, Agosti D, Bohm K: Semi-automated xml markup of biosystematics legacy literature with the GoldenGATE editor. In Proceedings of Pacific Symposium on Biocomputing, January 3–7, 2007; Wailea, Maui, Hawaii Edited by: Altman RB, Murray T, Klein TE, Dunker AK, Hunter L. 2007, 391–402. full_textGoogle Scholar
- Curry G, Connor R: Automated extraction of data from text using an xml parser: an earth science example using fossil descriptions. Geosphere 2008, 4(1):159–169. 10.1130/GES00140.1View ArticleGoogle Scholar
- Lydon S, Wood M, Huxley R, Sutton D: Data patterns in multiple botanical descriptions: implications for automatic processing of legacy data. Systematics and Biodiversity 2003, 1(2):151–157. 10.1017/S1477200003001129View ArticleGoogle Scholar
- Soderland S: Learning information extraction rules for semi-structured and free text. Machine Learning 1999, 34(1–3):233–272. 10.1023/A:1007562322031View ArticleGoogle Scholar
- Cui H: Converting taxonomic descriptions to new digital formats. Biodiversity Informatics 2008, 5: 20–40.View ArticleGoogle Scholar
- Wu ZY, Raven PH, (Eds): Flora of China. Beijing: Science Press & St. Louis: Missouri Botanical Garden Press; 1994.Google Scholar
- Wu ZY, Hong DY, Raven PH, (Eds): Flora of China. Beijing: Science Press & St. Louis: Missouri Botanical Garden Press; 2001.Google Scholar
- Diggs G, Lipscomb B, O'Kennon R: Shinners & Mahler's Illustrated Flora of North Central Texas. Fort Worth, Texas: Center for Environmental Studies and Department of Biology, Austin College, Sherman, Texas, and Botanical Research Institute of Texas (BRIT); 1999.Google Scholar
- Greenstone Digital Library Software[http://research.sbs.arizona.edu/gs/cgi-bin/library]
- Wood MM, Lydon SJ, Tablan V, Maynard D, Cunningham H: Using parallel texts to improve recall in IE. recent advances in natural language processing: In Selected Papers from RANL: 10–12 Sept, 2003; Samokov, Bulgaria 2003, 70–77.Google Scholar
- Diederich J, Fortuner R, Milton J: Computer-assisted data extraction from the taxonomical literature.[http://math.ucdavis.edu/~milton/genisys.html]
- Biodiversity Heritage Library[http://www.biodiversitylibrary.org/]
- Cui H, Heidorn PB: The reusability of induced knowledge for the automatic semantic markup of taxonomic descriptions. Journal of the American Society for Information Science and Technology 2007, 58(1):133–149. 10.1002/asi.20463View ArticleGoogle Scholar
- Brown Corpus[http://khnt.aksis.uib.no/icame/manuals/brown/]
- Kiger RW, Porter DM: Categorical Glossary for the Flora of North America Project.[http://huntbot.andrew.cmu.edu/HIBD/Departments/DB-INTRO/IntroFNA.shtml]
- Plant Characteristics[http://herbaria.plants.ox.ac.uk/vfh/image/?glossary=show]
- PATO - Phenotypic quality ontology[http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=UO]
- Cui H: Competency evaluation of plant character ontologies against domain literature. Journal of American Society of Information Science and Technology 2010, 61: 1144–1165.Google Scholar
- The Stanford Parser[http://nlp.stanford.edu/software/lex-parser.shtml]
- Moore RC, Teichert C, Robison RA, Kaesler RL, Selden PA, (Eds): Treatise on Invertebrate Paleontology. Lawrence, Kansas: University of Kansas and Boulder, Colorado: Geological Society of America;Google Scholar
- Cui H: Unsupervised Semantic Markup of Literature for Biodiversity Digital Libraries. Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital libraries 2008, 25–28. full_textGoogle Scholar
- Riloff E, Jones R: Learning dictionaries for information extraction by multi-level bootstrapping. In Proceedings of the Sixteenth National Conference on Artificial Intelligence, July 18–22, 1999. Orlando, Florida. American Association for Artificial Intelligence; 1999:474–479.Google Scholar
- Cui H, Boufford D, Selden P: Semantic annotation of biosystematics literature without training examples. Journal of American Society of Information Science and Technology, in press.Google Scholar
- Cui H, Heidorn P, Zhang H: An approach to automatic classification for information retrieval. In Proceedings of the Second ACM/IEEE-CS Joint Conference on Digital Libraries: 14–18 July 2002; Portland. Edited by: Marchionini G, Hersh W. Association for Computing Machinery; 2002:96–97. full_textView ArticleGoogle Scholar
- The Kepler Project[http://kepler-project.org/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.