Information extraction from full text scientific articles: Where are the keywords?
© Shah et al 2003
Received: 5 March 2003
Accepted: 29 May 2003
Published: 29 May 2003
Skip to main content
© Shah et al 2003
Received: 5 March 2003
Accepted: 29 May 2003
Published: 29 May 2003
To date, many of the methods for information extraction of biological information from scientific articles are restricted to the abstract of the article. However, full text articles in electronic version, which offer larger sources of data, are currently available. Several questions arise as to whether the effort of scanning full text articles is worthy, or whether the information that can be extracted from the different sections of an article can be relevant.
In this work we addressed those questions showing that the keyword content of the different sections of a standard scientific article (abstract, introduction, methods, results, and discussion) is very heterogeneous.
Although the abstract contains the best ratio of keywords per total of words, other sections of the article may be a better source of biologically relevant data.
Most applications of information extraction from the scientific medical bibliography use the Abstract of the publication (for review see for example [1–3]). In the context of information extraction in molecular biology it is usually understood that the information to be extracted from an article are words regarding biological concepts that could synthesize the main points of the article (keywords). Therefore the Abstract of a paper is a good target for information extraction because by definition an abstract synthesizes the content of the article. Moreover, abstracts are available in public databases. However, nowadays most journals are also available in electronic version, and thus full text articles can be used for information extraction.
It is obvious that the full text of an article contains more information than its Abstract. However, in approaching full text analysis several problems must be tackled. On the one hand, the storage of full text articles requires more disk space and the analysis needs more computational capacity. On the other hand, an Abstract, as a summary, contains a high frequency of relevant terms (keywords), but this may not be the case of the rest of the article.
Other questions regard the quality of the information carried by different sections of an article. First of all, is the information in full text organized enough so that keywords can be extracted? Secondly, different biological concepts (for example, gene and protein names, tissue names, organisms, experimental conditions, etc.) may be located in different parts of the article. Or it could be that a word has a different meaning depending on the section where it is located (the word has a context dependent meaning). For example, regarding gene names, those found in the Methods section may refer mostly to analytical tools rather than being relevant to the biological phenomenology described in the whole article. In summary, it would be good to quantify and qualify the information in a full text article before embarking in large scale extraction of particular items of information.
With this goal in mind, we analyzed in this work the kind of information that is attached to different parts of an article and we tried to quantify how much information can be found in each section of an article. This should help to state some guidelines for researchers attempting to extract particular keywords (words synthesizing the content of the article) from full text articles.
As previously stated, the major objective of this work was to compare the information defined as keyword content carried by different sections of a paper, especially the differences between the Abstract and the rest. Therefore, as source for our analysis we used a set of full text articles with a regular section structure, in our study having a defined Abstract, Introduction, Methods, Results, and Discussion (A, I, M, R, D). Another requirement was certain homogeneity of style across the articles (for example, a similar length of the Methods section) and, since there is great interest in the field of data mining on the detection of gene names, the subject should be related to Genetics. Thus, we chose the 104 articles published in Nature Genetics from June 1998 (volume 19, issue 2) to June 2001 (volume 28, issue 2), which comply with the AIMRD structure. Note that other journals, or even the Letters of the very same Nature Genetics, might have a different structure (for example, lacking separated I, M, R, D sections).
To simplify matters, and following our previous work , we focused on the extraction of relevant words (keywords) regarding objects, detected as nouns from natural text by a standard grammatical tagger (TreeTagger, Helmut Schmid, IMS, Stuttgart University, http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/). In order to derive keywords from the section of an article, we first compute the associations between the words in the section. Here, we take the sentence as the unit of text to look for associations, that is, two words are associated in the context of a section if they co-occur repeatedly in sentences within that section (see METHODS).
Since words associated strongly to many other words are relevant to the matter that is dealt in the article  we use a score (K) that is higher for words with many and strong relations to other words (see METHODS). This measure is used to select words as keywords, in this case, related to objects such as proteins, genes, organisms, etc.
In order to evaluate the performance of the keyword detection, we observed how the selected keywords matched the MeSH (Medical Subject Headings, http://www.nlm.nih.gov/mesh/) terms attached by indexers at the National Library of Medicine to these 104 articles (18.6 on average). Since MeSH terms can be composed of several words (for example, "Learning Disorders"), we selected those composed of a single word (6.80 terms on average). We noted that the most unspecific (for example, animal) were often not present in the text and thus could not be matched by a keyword as opposed to species names (mouse, mycobacterium, human), or anatomical terms (hippocampus, cerebellum, breast). Of those single-word MeSH terms, 4.91 were found on average in the article (as nouns), and 2.22 were among the set of selected keywords (above K >= 0.3). Obviously, a more accurate comparison to MeSH terms would require the detection of bigrams, and trigrams (keywords composed of multiple words), but this is out the scope of our work. The recall when matching the original MeSH terms (6.80 on average) went down from 4.91 / 6.80 = 0.72 in the dictionary of 470.6 different nouns present in an article to 2.22 / 6.80 = 0.33 in the 66.6 keywords selected. However, since the size of the list of all nouns found in an article (470.6) is much larger than the number of keywords (66.6), the precision in matching the MeSH terms of an article increased from 4.91 / 470.6 = 0.010 to 2.22 / 66.6 = 0.033.
Keyword selection per section.
K >= 0.3
K >= 0.4
K >= 0.5
As a way to show that the keyword content in different sections is heterogeneous, we examined which keywords (if any) were selected in all the sections of an article. Our results indicate that, as it could be expected, not many keywords are present in every section and those are not very relevant. Even for a low threshold of K >= 0.3, there is on average only one of such general keywords per article. Those are often non-informative words such as "gene", or "protein". This indicates that the information is unevenly distributed across the sections of the article, that is, different sections contain different kind of information.
Average number of keywords (K >= 0.5) shared by two sections for the corpus of 104 articles.
This result indicates that each section contains certain keywords that are unique to the section. In the following we try to characterize what are the differences in content between sections.
To make a deeper analysis of the kind of information present in each of the sections, we classified in seven categories a set of words present in our corpus of 104 articles (among the most frequent nouns). In order to do so as unambiguously as possible, we selected words that matched MeSH descriptors also consisting on that single word and belonging to only one major MeSH category (see METHODS). We added another category not present in MeSH, that of "Units, Dimensions, & Parts" in order to account for many terms that are currently not MeSH terms but are of interest to us.
Since the detection of gene and protein names is a very important subject, broadly used for the detection of macromolecular interactions (see for example ), and because, as stated in the introduction, we are concerned about the relevance of matching gene names in different sections of an article, we examined the distribution of gene names across sections.
Detection of gene names appearing only in the Methods section.
Definition of a Yeast strain
Can1, Leu2, Lys2, Trp1
Correct (technical context):
Platelet mRNA analysis
Primers used to determine embryo sex
Analysis of mutant phenotypes
SNP found in cDNA
Detection of meiosis specific genes
Mei4, Mek1, Sps4, Zip1
There is a clear need for doing information extraction of biological data from full text scientific articles and the means for doing it are there with computers better suited for faster computation every day and new methodologies for Natural Language Processing that can be used for biomedical literature (see for example ). Regarding the source of data, the full text electronic versions of journals are now more the rule than the exception, with initiatives in the way towards the construction of large public repositories of such information (although hotly debated; see about PubMed Central [15, 16]).
In this work we have shown that the distribution of information (as keywords) in full text articles is heterogeneous and that there is certain correspondence of article sections with different kind and density of relevant data. The Abstracts are shown as the best repository from the point of view of having many keywords in a short space, justifying previous information extraction approaches. The lack of large repositories of full text articles in contrast to the current eleven million of references (many of them with their abstract) in the MEDLINE database, are another advantage of the Abstract approach.
However, we have shown that there is much more relevant information (at least in a ratio of 1:4 regarding gene names, anatomical terms, organism names, etc.) in the rest of the article. We have demonstrated that the information is structured enough to get important numbers of relevant keywords, but that for certain words (such as gene names) caution has to be taken regarding the context of the word.
We propose that the text mining of full text articles should be approached with different strategies for different sections. Beyond the Abstract, the Introduction looks like the best place to look for protein and gene names (and interactions) since it is probably describing current knowledge. The Discussion section, that interprets the results and put them in context with the current knowledge, looks like the third best place for mining such information, with Methods probably as the worst place. The Results section could be problematic given its mixed nature between Methods and the rest.
Regarding other subjects, such as keywords about biological concepts (species, tissues, diseases, etc.), again the Abstract and then the Introduction section look like the best sections to mine regarding frequency of such keywords, but Results and especially Discussion seem better from a quantitative point of view. The Methods section is clearly appropriated for looking for technical data, measurements, and chemicals. Respect to chemicals, again, their context can be completely different in this section compared to the rest.
Extraction of biological information from full text looks promising, but context must be regarded. Part of this context is given by the situation of the text under analysis within the article. Therefore, tuning the extraction of information to the section is probably a good strategy, and for particular tasks some sections should be avoided.
We have shown that the kind of simplistic annotation that constitutes tagging a fragment of an article as belonging to a characteristic section is already useful for text mining. But further tagging using markup codes in XML style  identifying biological objects and concepts (under development; see for example  or the GENIA project http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/) could ultimately make text mining a children's game. We hope for future interfaces for writers of Molecular Biology articles that should do the job upon validation by the authors (for example, marking every occurrence of a gene name with a unique and stable link to any of the existing gene sequence databases). For this to happen, the collaboration between both scientists and publishers will be very important.
Given a section from an article, we split the text in sentences using a standard part of speech tagger (TreeTagger). We only computed associations between the words identified from the tagging as nouns. Following , the association between two words (w i ,w j ) (for example, "cell" and "cycle") can be modeled as the degree of inclusion of one word into the other ( ) which can defined as the fuzzy binary relation given by: , that is, the ratio of the number sentences where both words w i and w j co-occur to the number of sentences the word w i occurs. This is an asymmetric relation very appropriate to model hierarchical relations between words as they happen in natural text. For example, in some Cell Biology context, the word "cycle" could appear always related to the word "cell" (as in "cell cycle"), but the word "cell" can be related to many other words such as in "cell growth", "cell membrane", or "cell nucleus". Accordingly, the inclusion value of the word "cycle" into "cell" will be close to one and the inclusion value of the word "cell" into the word "cycle" will be close to zero.
We identify a word as relevant for the text analyzed if it establishes many and strong relations to other words (following ). Therefore, in a given section, we define a score for a word w i that is equal to , normalized to the maximum value found for K of any word in that section. Then, the keywords of the section are defined as those words that have a K score above a certain value.
In order to classify words into categories we used the following procedure. We chose the MeSH (Medical Subjects Headings) classification from the National Library of Medicine. All MeSH terms (including official synonyms) composed of one single word were selected and then the stem of the word was computed using TreeTagger. The words present in our corpus of 104 articles were ordered by frequency and all words occurring more than 200 times were selected. Those matching the selected single-word MeSH headers from six categories (A, B, C, D, E, and G; See the caption of Figure 4 for descriptions) were selected as belonging to those classes. In order to avoid possible miss-annotations, words matching more than one category were discarded. Manual analysis of the resulting table of associations was carried out in order to check the associations and to make new ones. A new class not present in MeSH (the X class of "Units, Dimensions, & Parts") was generated in order to include a large number of terms mainly present in the Methods section.
We are grateful to the developers and maintainers of the different databases used in this work (SWISSPROT, MeSH, Nature Genetics full text repository), to Helmut Schmid (IMS, Stuttgart University) for distributing TreeTagger, to Harindar S. Keer for help with the data management, and to the members of our group at EMBL-Heidelberg for fruitful discussions.
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.