NetiNeti: discovery of scientific names from text using machine learning methods
© Akella et al.; licensee BioMed Central Ltd. 2012
Received: 15 October 2010
Accepted: 6 August 2012
Published: 22 August 2012
Skip to main content
© Akella et al.; licensee BioMed Central Ltd. 2012
Received: 15 October 2010
Accepted: 6 August 2012
Published: 22 August 2012
A scientific name for an organism can be associated with almost all biological data. Name identification is an important step in many text mining tasks aiming to extract useful information from biological, biomedical and biodiversity text sources. A scientific name acts as an important metadata element to link biological information.
We present NetiNeti (Name Extraction from Textual Information-Name Extraction for Taxonomic Indexing), a machine learning based approach for recognition of scientific names including the discovery of new species names from text that will also handle misspellings, OCR errors and other variations in names. The system generates candidate names using rules for scientific names and applies probabilistic machine learning methods to classify names based on structural features of candidate names and features derived from their contexts. NetiNeti can also disambiguate scientific names from other names using the contextual information. We evaluated NetiNeti on legacy biodiversity texts and biomedical literature (MEDLINE). NetiNeti performs better (precision = 98.9% and recall = 70.5%) compared to a popular dictionary based approach (precision = 97.5% and recall = 54.3%) on a 600-page biodiversity book that was manually marked by an annotator. On a small set of PubMed Central’s full text articles annotated with scientific names, the precision and recall values are 98.5% and 96.2% respectively. NetiNeti found more than 190,000 unique binomial and trinomial names in more than 1,880,000 PubMed records when used on the full MEDLINE database. NetiNeti also successfully identifies almost all of the new species names mentioned within web pages.
We present NetiNeti, a machine learning based approach for identification and discovery of scientific names. The system implementing the approach can be accessed at http://namefinding.ubio.org.
There is a vast and ever growing amount of literature in biology, ecology, biomedicine, biodiversity, genomics and proteomics. The U.S National Library of Medicine’s MEDLINE  database is one such source with more than 18 million abstracts of journal articles in life sciences with focus in biomedicine. Major efforts to digitize legacy literature undertaken by consortiums like the Biodiversity Heritage Library (BHL)  generate vast amounts of text data from the Optical Character Recognition (OCR) of scanned literature. Extraction of knowledge from sources like MEDLINE can significantly speed up biomedical research by providing access to relevant information about diseases, genes, gene-protein, protein-protein interactions, model organisms and drugs. While gene/protein identifications and binary interactions have been the focus of biomedical text mining, more ambitious tasks like identifying complex nested structures are also being pursued currently .
Identification of species names and the normalization task of mapping them to identifiers in a database are considered essential sub-tasks for many text mining projects [4, 5] like recognizing gene names [6–8] or extracting organism-specific information like life history, geographic distribution and predator–prey relationships from biodiversity and biomedical literature. A scientific name is a genus name or a species level name with genus followed by species or a name below the species level with genus, species and subspecies information. It can also be a higher order taxonomic name like family, order, etc. A scientific name is one of the named entities that can be connected with other entities like gene names, protein names, geographic locations, diseases, common names of organisms and names of people who first described the species. Recognition of named entities is frequently a first step in the process of performing more complex information extraction tasks like finding relations between the named entities or for question answering [9, 10]. The name of an organism is one of the few identifying elements associated with almost all biological data . A scientific name extraction system will be very useful in gathering all contexts in the form of sentences or paragraphs associated with organism names. These sentences and paragraphs can help enrich the existing content and add new content for projects like the Encyclopedia of Life (EOL), which aims to create a webpage for every single species on Earth . Natural language processing and machine learning methods can be applied to extract fine-grained, atomic information that can be used to populate biological databases and repositories. The organism name serves as an important metadata element for linking information from various biological sources [13–16], so a species name identification system is an essential tool in information integration.
Most of the approaches in the literature addressing the problem of name finding from text sources primarily rely on dictionaries with a list of scientific and/or common names [4, 14, 17, 18]. TaxonGrab  is a dictionary-based approach that uses a dictionary generated by combining dictionaries of English words and biomedical terms instead of a list of scientific names. Words that do not appear in this dictionary (inverse lexicon) and that follow simple rules for capitalization, abbreviations, variants and subspecies mentions used in scientific names are considered as organism names. Approaches that primarily rely on this kind of an inverse lexicon tend to have low precision as this can gather many false positives from misspelled English words, OCR errors and non-English words that pass through the rule filters. The precision of the system can also vary significantly from one text source to another depending on the number of words covered by the inverse lexicon. Hence such a system is also likely to perform very poorly on non-English texts.
TaxonFinder  is designed to find scientific names from text with the help of separate dictionaries for species and genus names. Though the approach is likely to have fewer false positives, the number of false negatives (the number of correct names missed) can be high as it cannot find anything that is not a genus and species combination from the dictionaries used in the approach. Such an approach cannot find misspelled names, names with OCR errors, new species names and other names not present in the dictionary. Such a system can also have false positives due to the presence of incorrect names, names that are spelled the same as some common English words and geo-location names (e.g. major, Atlanta).
The approach “Linnaeus”  uses dictionaries for scientific and common names to construct a DFA (Deterministic Finite Automaton)  to match species names. The system also tries to resolve acronyms for organisms (e.g. HIV, CMV) using the frequencies of most commonly used acronyms in MEDLINE calculated using Acromine . Linnaeus only focuses on finding species names and currently does not deal with genera or other higher-order taxonomic units. Inherently being a dictionary based approach, Linnaeus also will have issues that were discussed above for approaches like TaxonFinder. There are also other dictionary-based approaches that identify species names based on the NCBI taxonomy [21, 22]. FAT (Find All Taxon names)  is another tool that uses a combination of rules, dictionaries of scientific names and non-names along with input from users to find scientific names. Wang et al. [8, 23, 24] developed approaches to tag and disambiguate genes, proteins and protein-protein interaction with species names from the NCBI taxonomy, Uniprot  and manually created dictionaries using a rule based approach and/or with a machine learning based classifier. Their main objective was to disambiguate gene/protein or protein-protein mentions in text using species tags.
Here we focus on recognition/discovery of scientific names of organisms from various text sources. The problem of discovery of binomial and trinomial scientific names along with genera and higher taxonomic units can be quite complex. For example, biodiversity literature and legacy text sources like BHL (Biodiversity Heritage Library) contain many names with OCR errors, alternative names and misclassified names. Thousands of new species are discovered every year and many are reclassified. Some names are spelled the same as geo-locations or people names and therefore disambiguation of names is required. We have developed approaches and built tools that address all of the above.
NetiNeti is a solution for scientific name recognition/discovery. This approach enables finding scientific names in literature from various domains like biomedicine and biodiversity. It can discover new scientific names and also find names with OCR errors and variations. The system is based on probabilistic machine learning methods where a given string has a certain probability of being a scientific name or not being a scientific name depending on the name string itself and the context in which it appears. NetiNeti builds a machine learning classifier from both the structural features of a string and its contextual features. In the process of classifying a string, the approach can differentiate between common words like names of places or people from scientific names based on the context in which a name appears. For example, Atlanta is a scientific name in the sentence, “Atlanta is a genus of pelagic marine gastropod molluscs”. However, in the sentence, “The city Atlanta is in the state of Georgia”, Atlanta is a geographic location and not a genus name. NetiNeti correctly recognizes the word Atlanta as a scientific name in the first context and does not recognize it as a scientific name in the second context. Simple rules for capitalization and abbreviations in species names are applied as a pre-filtering step to generate candidate names. Candidates with common English words were also removed in the pre-filtering process. The candidate names along with their contexts are then classified using a supervised machine learning classifier. While the system can disambiguate and discover what scientific names of organisms are mentioned in a document, the approach is not about discovering documents that are about specific organisms based on their presence in the document.
We evaluated NetiNeti on legacy biodiversity texts (BHL books) and biomedical literature (MEDLINE). We compared results of NetiNeti and a dictionary based scientific name finder with the results of manual annotation of a BHL book. A comparison of some of the probabilistic machine learning algorithms on our annotated dataset for scientific name finding is presented. We also present the results of running NetiNeti on other biological text sources.
The input text is first tokenized using a tokenization scheme that breaks the characters in a stream of characters in natural language text into distinct meaningful units called tokens. We followed the conventions used by the Penn Treebank project  to tokenize text. Word trigrams, which are groups of three tokens along the token-sequence are then generated from the tokenized text and each trigram is then passed through a simple rule filter which checks if the tokens in the trigram have the right capitalization, abbreviations, etc. and checks if the trigram has no common English words. Each trigram that passes through the rule filter is then classified by a machine learning classifier as “scientific-name” or “not-a-scientific-name” using the structural and contextual features of the trigram. The trigram that was classified as a scientific name corresponds to a trinomial name, which is a name below the species level with genus, species and usually a subspecies. If a trigram fails to pass though the rule filter, the first two tokens (word bigram) of the trigram are then tested to see if they can become a candidate for a binomial name, with genus followed by a species mention. The classifier then classifies such candidate bigrams. Similarly, the first token of a failed bigram is analysed if it can become a candidate for a uninominal name (genus or higher order taxonomic unit), which gets classified accordingly if it is deemed as a candidate. NetiNeti also resolves abbreviated species names by noting that an abbreviation can be used for a species after a mention of its genus or an abbreviation can follow a mention of a full name (genus-species combination) or an abbreviated name for a species can be used after a mention of another species name from the same genus.
We are primarily interested in the conditional probability of a class label, given an input string and its contexts s j as in Eq.1. The ‘yes’ and ‘no’ labels correspond to whether a string is a scientific name or not. Once we get these conditional probabilities, we simply choose the label with the highest probability for a given string. The Naïve Bayes classifier [27–29] as seen in Eq.1. actually models the joint probability of a class c and a string s and makes an assumption that all the features for the string and its contexts given the class label are independent as in Eq.1 This independence assumption is strong, but it helps to easily estimate the probability , of a string s j given the class label c i from a training set of labelled examples. Even with this independence assumption, the Naïve Bayes classifier performs surprisingly well in many document classification tasks [27, 29]. can be estimated from the number of training examples having the feature value f k , and the number of examples with class label c i and also having the feature value f k We can then get the class label for a string (along with its contexts) from Eq.2 with probabilities taken in the log scale.
Where each is a binary valued feature function defined on the class label and the string context, s are the weights to be learned from the training set for the feature functions and is a normalizing factor that ensures that . The parameters are estimated via hill climbing approaches like Improved Iterative Scaling (IIS)  or Generalized Iterative Scaling (GIS) . Limited-Memory Variable Metric optimization methods like L-BFGS  have been found to be effective for Maximum Entropy parameter estimation . In our scientific name recognition task, we have applied and compared the IIS, GIS and L-BFGS methods for parameter estimation on a corpus that was manually annotated with scientific names. For both Naïve Bayes and the Maximum Entropy classifiers, we used the Python  implementations in the NLTK  package. MEGAM  optimization package was used for L-BFGS optimization.
An initial set of about 5,000 names was used as a positive example set. Candidate strings from unigram, bigram and trigrams of a tokenized BHL book , which does not contain any scientific names, was used as an initial negative example set. An initial maximum entropy classifier was trained with the initial training set using only the structural features of strings. A set of MEDLINE abstracts, a small portion of content from EOL  and biodiversity texts from BHL were segmented into sentences using the sentence tokenizer in NLTK, pre-filtering and candidate generation steps were performed for each sentence, and the initial classifier was used to get scientific names that were identified with high confidence. The scientific names along with the sentences in which they occur together form the positive example set. Features were derived from the scientific names and a neighborhood of word contexts appearing around the scientific names in the sentences. We tokenized a geography book from the Internet archive  and the strings derived from word unigrams, bigrams, and trigrams in the tokenized text of the book form the negative example set. About 10,000 positive examples with contextual information, another 10,000 examples from scientific names without contextual information were used as the positive example set. Abbreviated names from these examples were also added to the positive example set. A total of about 40,000 positive examples together with another set of about 43,000 negative examples were used to generate a training set of 83,000 examples for the two class labels. Features used include the last three, last two and the last characters along with the first and second characters of the unigram, bigram, and trigram candidates. Binary features like whether the last, second last, and third last characters are present in different partitions of the set, ’a’,’e’,’i’,’o’,’u’,’s’,’m’ were also used. Presence or absence of a particular word in unigram, bigram, and the trigram candidates in a dictionary of genus and species combinations were also part of the binary features. When a word token is part of the dictionary of names it contributes to the conditional probability of the candidate name given the structural and contextual features. Numerical features like the number of vowels in various parts of the candidate names were also used. For contextual features, words appearing in the neighborhood of candidate names and their parts-of-speech tags were used.
NetiNeti focuses on discovering/identifying scientific names of organisms including names with spelling and OCR errors from text sources across domains like biodiversity and biomedicine. We present the results of running NetiNeti on three different text sources.
Precision and recall values for NetiNeti, TaxonFinder and FAT on the american seashell book
Results of running NetiNeti with Naïve Bayes and MaxEnt (GIS) on MEDLINE
Binomial and Trinomials
Figure 1C illustrates the effect of increasing number of contextual features and increasing the number of positive examples in the training set. For example, the blue stars in Figure 1C correspond to using five contextual features on either side of the candidate name with increasing positive example size during training. This was more clearly represented in Figure 1D, where we used five contextual features (cspan = 5) on either side of the candidate name for each classifier with increasing sizes of positive example sets form 3,000 to 19,000 in increments of 2,000 for training. It can be seen from Figure 1D that increasing the positive example set contributed to the better precision of the corresponding classifier with a slightly lower value for recall.
Precision and recall values for naïve bayes, maximum entropy (iis, gis, l-bfgs) and decision tree learning algorithms on the american seashells book
When calculating the precision and recall reported in Figure 2, we have taken into account only the true false positives. We can see that the recall for TaxonFinder is significantly lower compared to NetiNeti, while the precisions are comparable. For a dictionary-based approach like TaxonFinder, it is less likely to have many false positives as it only retrieves what is already present in a known set of names in the dictionary and so can have higher precision, but the recall can be very low as we have seen in the results summarised in Figure 2, the number of false negatives (the number of correct names missed) can be high as it cannot find anything that is not a genus and species combination from the dictionaries used. Such an approach also cannot handle misspelled names, names with OCR errors, new species names, or other names not present in the dictionary. NetiNeti on the other hand will handle these well and it is a name discovery tool. A comparison of NetiNeti, TaxonFinder and FAT tool for the BHL book is presented in Table 1. The FAT approach has lower precision and recall values compared to NetiNeti and TaxonFinder approaches for this corpus. The names marked up by the FAT tool were compared with the manual mark up. 869 of the names identified by FAT did not match with the manually marked up set of names. Most of these unmatched names are species epithets with authorship information. We further analyzed a random sample of 100 names out of these 869 names and examined genus information interpreted by the tool in the marked up tags. 32 of the 100 mismatched names have correctly interpreted genus names and the remaining are all true false positives with incorrect genus tags. We estimated that 278 of these 869 are correct identifications and the adjusted precision and recall values for the FAT approach were summarized in Table 1. For many of the true false positives, the FAT tool tags the species epithet, but does not seem to recognize the genus name immediately preceding the species name.
Comparison of NetiNeti and TaxonFinder on web pages with new species descriptions
Desmoxytes purpurosea **
Electrolux addisoni **
Gryposaurus monumentensis **
Malo kingi **
Megaceras briansaltini **
Tecticornia bibenda **
Culexiregiloricus trichiscalida **
Lebbeus clarehanna **
S. ysbryda **
Selenochlamys ysbryda **
Gymnotus omarorum **
Swima bombiviridis **
Nepenthes attenboroughii **
O. vermiculum **
Opisthostoma vermiculum **
Nephila komaci **
We also analysed the results of running NetiNeti on the whole of MEDLINE with Naïve Bayes and Maximum Entropy (GIS) classifiers, which were the top two algorithms in terms of F-scores in Table 2. The results were summarized in Table 2. NetiNeti with the Naïve Bayes algorithm found 193,596 unique binomial and trinomial names while the Maximum Entropy algorithm found 188,606 names. That is more than 3 times the number of species found by the dictionary-based Linnaeus system even though we focus only on scientific names. In the names extracted from MEDLINE, the errors include disease names like Enterohepatitis, terms like Amputatio interilio-abdominalis which was extracted from title of a PubMed article in Russian, chemical names like Aminoanthracene. Some of the errors in biodiversity text include terms like Operculum corneous, words associated with some geographic locations like Panaina. Biological terms and certain words associated with geographic locations can be the kind of errors common to both the corpora. Also, named entities with Latin-like endings can be incorrectly identified as scientific names of organisms by the system especially when there is little or no contextual information.
The system is highly scalable and we ran name finding on the recent update of MEDLINE with over 18 million abstracts in under 9 hours on a 2.8 Ghz intel core i7 based machine running Mac OX 10.6 using 6 cores.
As NetiNeti also extracts names with errors and variations, a need to map the names to known identifiers in a master list of names or a database arises. We are working on highly efficient methods based on suffix-trees to do such a mapping.
The software system implementing NetiNeti can be accessed at http://namefinding.ubio.org. Currently a Naïve Bayes classifier is applied by default for name finding. The American Seashell book and a list of PubMed Central ids used for evaluation of NetiNeti can be found at http://ubio.org/netinetifiles
In this article, we presented an approach for recognizing/discovering scientific names along with spelling errors and variations from various text sources in domains like biodiversity and biomedicine. We present NetiNeti as a solution to name discovery that uses machine learning techniques to classify candidate names generated by applying rules and pre-filtering methods on text. NetiNeti is highly scalable and configurable.
Whether to know the number of scientific names covered in a text, to extract all the sentences/paragraphs associated with scientific names or to tag mentions of genes, protein or other entities with scientific names or whether to incorporate species names as meta data elements for search, etc. or for taxonomic indexing, an identification and discovery tool like NetiNeti is very useful.
This project was funded by the Ellison Medical Foundation and a grant from the National Library of Medicine (R01 LM009725). We thank Anna Shipunova for providing manual annotation and for helpful discussions on scientific names. Anna has more than 10 years of experience in the Department of Biology at Moscow State University where biological text processing was her major focus. At the MBL she worked with the Encyclopedia of Life biodiversity informatics group before joining the Neti Neti project. We also would like to thank David Patterson and Nathan Wilson for helpful discussions and comments on the manuscript.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.