The textual characteristics of traditional and Open Access scientific journals are similar
© Verspoor et al; licensee BioMed Central Ltd. 2009
Received: 19 December 2008
Accepted: 15 June 2009
Published: 15 June 2009
Recent years have seen an increased amount of natural language processing (NLP) work on full text biomedical journal publications. Much of this work is done with Open Access journal articles. Such work assumes that Open Access articles are representative of biomedical publications in general and that methods developed for analysis of Open Access full text publications will generalize to the biomedical literature as a whole. If this assumption is wrong, the cost to the community will be large, including not just wasted resources, but also flawed science. This paper examines that assumption.
We collected two sets of documents, one consisting only of Open Access publications and the other consisting only of traditional journal publications. We examined them for differences in surface linguistic structures that have obvious consequences for the ease or difficulty of natural language processing and for differences in semantic content as reflected in lexical items. Regarding surface linguistic structures, we examined the incidence of conjunctions, negation, passives, and pronominal anaphora, and found that the two collections did not differ. We also examined the distribution of sentence lengths and found that both collections were characterized by the same mode. Regarding lexical items, we found that the Kullback-Leibler divergence between the two collections was low, and was lower than the divergence between either collection and a reference corpus. Where small differences did exist, log likelihood analysis showed that they were primarily in the area of formatting and in specific named entities.
We did not find structural or semantic differences between the Open Access and traditional journal collections.
For much of the modern period of biomedical natural language processing (BioNLP) research, work in text mining has focused on abstracts of journal articles. Free and widely available via PubMed/MEDLINE in numbers previously unseen in most statistical text mining work, abstracts enabled a mass of work that has grown remarkably quickly . In recent years, however, there has been both a growing awareness that full text articles are important, and an increasing amount of work using the full text of articles. As early as 2001, Blaschke and Valencia examined recoverability of databased protein-protein interactions from text and concluded that the ability to handle full text would be essential to achieving high-coverage performance . Shah et al. examined the location of biologically relevant words in journal articles and found that although the density of biologically relevant terms is higher in the abstract than in the body of the article, there is much more relevant information in the body of the article than in the abstract . Corney et al. (2004) provided a careful quantification of the costs of failing to work with full text, finding that more than half of the information in molecular biology papers was in the body of the text and not in the abstract .
At the same time, it became clear very early on that full text poses challenges that are different from those of abstracts. For example, Tanabe and Wilbur (2002) found that some sections (particularly Materials and Methods) tend to produce much higher rates of false positives on information extraction tasks than others . Furthermore, the substantial length of full text articles as compared to abstracts means that it is likely more difficult to identify individual entities or events, due to the increased linguistic complexity of the text, and the use of longer-distance references. Preprocessing requirements alone can be prohibitively time-costly with full text. Even issues of character encodings and how various journals deal with them – solutions range from inserted gifs to HTML character entities to Unicode – are sufficient to throw off character-offset-based systems, which are increasingly popular.
These problems notwithstanding, recent years have seen an increased emphasis on working with full text papers (see e.g.  and  for papers that review a substantial amount of work using full text). However, much of this work is done with Open Access journal articles, and with the availability of the PubMed Central Open Access subset  of close to 90K biomedical publications (and growing), we expect research on full text to further concentrate on Open Access publications. Such work will assume that the Open Access articles are representative of biomedical publications in general and that methods developed for analysis of Open Access full text publications will generalize to the biomedical literature as a whole. This assumption requires investigation due to the possibility that there exist significant differences in format or content. For instance, the majority of open access journals have to date been exclusively electronic publications, often without formal restrictions on article length (such as the BioMed Central journals), where the lack of strict space constraints could certainly impact the language authors use to present their findings. Furthermore there is at least a perception that these journals often have quicker turnaround on the time from submission to publication , and that open access publications have higher community impact , both of which could affect the sort of research results that are submitted to open access journals. Similarly, the cost of publication of open access articles may mean that authors tend to submit longer articles combining more research results. The effect of such differences on the textual characteristics of the publications has not to our knowledge been previously explored.
If the basic assumption of the representativeness of Open Access publications is wrong, the cost to the community will be large, including not just wasted resources but also flawed science. This paper sets out to examine that assumption. Our null hypothesis is that traditional and Open Access publications are the same; we seek to find differences between them.
Results and Discussion
We developed or assembled four text collections for comparison.
CRAFT is the Colorado Rich Annotation of Full Text corpus. This is a true corpus in the linguistic sense of that word – a static set of documents with associated linguistic and semantic annotations. The document set was assembled from the PubMed Central Open Access subset  with input from the Mouse Genome Informatics group at the Jackson Laboratory to ensure biological relevance. It focuses on mouse genomics. The corpus comprises 97 open access articles containing nearly 750K words.
TraJour (Tra ditional Jour nals corpus) is a document collection that we assembled from traditional subscription-based journals, with the intent of collecting a set of texts that topically parallels the CRAFT corpus as closely as possible. This parallelism was achieved via shared Gene Ontology annotations (see the Methods section). TraJour consists of 99 articles and almost 600K words.
Reference is a corpus based on the the Wall Street Journal corpus. This is a collection of newspaper articles that has been extensively annotated in the course of the Penn Treebank  and PropBank  projects. We took the raw text version from the Penn Treebank distribution. It contains about 1.1 million words.
BioReference is a document collection which aims to be representative of full text biomedical publications in general, rather than being tailored to mouse genomics. It was constructed from a random subsample of two document collections: the TREC Genomics Corpus , containing full text publications from primarily subscription-based traditional journals, and the PubMed Central Open Access subset, containing exclusively Open Access publications. It is comparable in size to CRAFT and TraJour, at 650K words in 163 articles.
Characteristics that we compared in the corpora
We compared the corpora according to various surface-level characteristics as well as several linguistic phenomena. We performed comparisons of the statistical properties of the vocabularies of the corpora in order to identify important variations of language use among them. The two corpora of primary interest are the two semantically comparable corpora – CRAFT, our open access publication corpus, and TraJour, our traditional journal corpus.
We examined the incidence of a number of morphosyntactic/semantic phenomena in the four sets of documents. We selected them because each is known to have consequences for natural language processing: in particular, all of the morphosyntactic phenomena that we examined make the text mining task more difficult by introducing complexity and variability in the linguistic structures found in the text. The linguistic phenomena that we examined were negation, passivization, conjunction, and pronominal anaphora.
To examine negation, we counted every instance of the words no, not, neither, and nor, as well as the affix n't. To examine passivization, we counted instances of the strings ed by, en by, and ound by. This clearly underestimates the number of passives. For example, conjoined passive verbs, as in eEF2 kinase is phosphorylated and inhibited by SAPK4/p38 delta , will be undercounted. Similarly, intervening adverbials, as in MAPK is activated primarily by FGF in this context , will cause undercounting, as will bare passives (i.e. those without a subsequent by-phrase indicating the agent). However, it yields a reasonable approximation of the number of passives, and the undercounting applies proportionally to all four document sets, so the intra-corpus comparison probably remains valid, although we would need to do a separate analysis to verify this. To examine conjunction, we counted every instance of and, or, and but not. Finally, to examine pronominal anaphora, we counted every instance of any pronoun. In each case, we normalized the counts by the number of words in the corpus.
Incidence of syntactic/semantic phenomena
Avg. Sentence count
Avg. Document length
Avg. Sentence length
The directions of the differences with the reference corpus are mostly not surprising. Passives are more common in the two semantically matched corpora (0.39% and 0.43%) and in the BioReference (0.48%) than they are in the Reference corpus (0.24%). This accords with the observation that passives are almost caricatural of scientific writing and are quite common in biomedical language .
Conjunctions are more frequent in the scientific corpora than in the reference corpus. As Biber et al.  point out in their corpus-based study of the grammar of English, comparison of competing hypotheses is a dominant theme in scientific writing. Comparison is often realized by use of conjunctions and by asserting the competing hypotheses. Thus the results are in line with previous research in this area, although a separate analysis would be required to establish what proportion of the conjunctions link competing hypotheses.
The pattern of incidence of negations is also in line with other contrastive reports of negation in the academic and news registers . Incidence of negatives in the two semantically matched corpora and the BioReference reference collection were quite similar – 0.46% for CRAFT, 0.43% for TraJour, and 0.45% for BioReference. However, they were much more common in the WSJ reference than in the three scientific corpora, at 0.69%. This is thought to be related to the use of other terms to express contrast in academic discourse, such as although, however, nevertheless, and on the other hand (81–82).
The preceding measures are all concerned with linguistic (conjunction, passivization, etc.) or structural (sentence length) feature distributions and their implications for processing difficulty. We now turn to measures that are more reflective of the semantic content of the corpora.
To further explore the possibility of important differences between CRAFT and TraJour, we looked at two measures of lexical difference and similarity. The first of these is Kullback-Leibler divergence , or relative entropy, and the second is log likelihood .
Intuitively, as two distributions become more different, the value for KL divergence increases. We assume a threshold value of 0.005 corresponds to near identity of the distributions. We calculated the KL divergence between CRAFT and TraJour and between each of the two and the reference corpora. We ordered words by frequency in the merged vocabulary of the corpora and then calculated the KL divergence for different values of the top n most frequent words, from the 100 most frequent words to the 10,000 most frequent words, comparing the probability distributions for those selected words in the two corpora. We employed Laplace (add-one) smoothing to accommodate for words which occurred in one corpus but not in the other.
KL divergence of term probability distributions, CRAFT versus TraJour
CRAFT v. TraJour
CRAFT v. BioRef.
TraJour v. BioRef.
CRAFT v. Ref.
TraJour v. Ref.
In contrast, if either corpus is compared against the reference corpus, they are drastically different, with KL divergences for the top 100 words of 0.161 and 0.167, respectively – far above the assumed identity threshold. Even compared with the BioReference corpus, the divergence is well above this threshold (0.044 and 0.021 @100 words), suggesting that there are significant lexical differences between the mouse genome corpora and general biomedical text, while there do not appear to be lexical differences simply due to the mode of publication of the text.
Log Likelihood analysis of terms in CRAFT vs. TraJour
Log Likelihood analysis of terms in CRAFT vs. BioReference
Log Likelihood analysis of terms in TraJour vs. BioReference
Log Likelihood analysis of terms in CRAFT vs. Reference
Log Likelihood analysis of terms in TraJour vs. Reference
We can analyze this data in terms of two characteristics: the magnitude of the differences, and the semantic nature of the words in terms of which the various pairs of corpora differ.
TF*IDF-ranked terms in the corpora
In terms of linguistic phenomena such as conjunction, passivization, negation, and pronominal anaphora, the content-matched Open Source and traditional publications do not differ from each other. They also do not differ in terms of sentence length. When compared against reference corpora, they do differ from these more general document sets, indicating that if the Open Source and traditional journals did differ from each other, our methods would have uncovered those differences.
The two target corpora analyzed (CRAFT and TraJour) are both in the molecular biology domain, and more specifically mouse genomics. As such, the results and conclusions, strictly interpreted, apply only to the particular datasets we examined. Based on the analysis of the factors that might lead to textual variation (see Background), it would be conservative to assume that these results generalize to the molecular biomedical literature as a whole. We believe that generalizing these results to the entire biomedical literature, or even all peer reviewed scientific publications, is reasonable, although additional testing may be warranted for areas with substantially different cultures of scientific practice.
We tried hard to find differences between the CRAFT and TraJour document sets. We mostly failed. Research on Open Access documents applies to traditional, subscription-only journals.
Construction of the TraJour corpus
Construction of the BioReference corpus
One hundred PubMed identifiers were selected at random from each of two sources: the 2006 TREC Genomics Corpus  and the PubMed Central Open Access subset . These two sources were used because they are the only two large collections of full textpublications that we have access to. The TREC Genomics Corpus was collected originally for the Genomics Track of the Text Retrieval Conference. The 2006 corpus contains over 162K articles from 49 journals, ranging from the American Journal of Epidemiology to several American Journal of Physiology journals (e.g. Heart and Circulatory Physiology), and as such the corpus has quite broad coverage of biomedicine despite the "Genomics" name. Our selection included 41 articles from The Journal of Biological Chemistry, 12 from Blood, 4 each from Human Reproduction, Human Molecular Genetics, and the Journal of Applied Physiology, and 1–3 each from 20 other journals.
The portion of the BioReference corpus randomly selected from the PubMed Central Open Access included publications from Nucleic Acids Research (23 articles), Environmental Health Perspectives (9 articles), Ulster Medical Journal (4 articles), BMC Genomics (4 articles), Medical History (4 articles) and 44 other journals contributing 1 or 2 articles each.
Three of the articles selected for the PubMed Central dataset were missing from that set. After selecting the files and pre-processing them to extract the plain text, two files from the TREC Genomics collection were found to be empty. The corpus thus consists of 195 files containing content, 97 from the PubMed Central Open Access dataset and 98 from the TREC Genomics dataset. We then eliminated any files less than 1 kb (1024 bytes) in length, as those did not represent full text files. The remaining 163 files comprise a reference set which can be considered to be a balanced sample of both full text Open Access and traditional journal publications indexed in PubMed, and are not oriented on the topics relevant to mouse genomics on which CRAFT and TraJour are focused.
We have not performed significance testing of the statistical results provided in this paper as we are mostly interested in the qualitative differences that could impact text mining applications, and minor variations will always exist between any particular document corpora. This is a limitation of the approach.
The work of all three authors was supported by grants G08LM009639, R01LM009254, and R01LM008111 to Lawrence Hunter. We gratefully acknowledge the NIH scientific review author who originally suggested that we undertake this project and the reviewers of this paper for their thoughtful comments.
- Verspoor K, Cohen KB, Mani I, Goertzel B: Introduction to BioNLP'06. Linking natural language processing and biology: towards deeper biological literature analysis Association for Computational Linguistics; 2006, iii-iv. [http://www.aclweb.org/anthology/W/W06/W06–3300.pdf]View ArticleGoogle Scholar
- Blaschke C, Valencia A: Can bibliographic pointers for known biological data be found automatically? Protein interactions as a case study. Comparative and Functional Genomics 2001, 2(4):196–206.PubMed CentralView ArticlePubMedGoogle Scholar
- Shah PK, Perez-Iratxeta C, Bork P, Andrade MA: Information extraction from full text scientific articles: Where are the keywords? BMC Bioinformatics 2003., 4(20):Google Scholar
- Corney DP, Buxton BF, Langdon WB, Jones DT: BioRAT: extracting biological information from full-length papers. Bioinformatics 2004, 20(17):3206–3213.View ArticlePubMedGoogle Scholar
- Tanabe L, Wilbur WJ: Tagging gene and protein names in full text articles. Proceedings of the ACL'02 workshop on Natural language processing in the biomedical domain 2002, 9–13. [http://www.aclweb.org/anthology-new/W/W02/W02–0302.pdf]View ArticleGoogle Scholar
- Krallinger M, Leitner F, Rodriguez-Penagos C, Valencia A: Overview of the protein-protein interaction annotation extraction task of BioCreative II. Genome Biology 2008., 9(Suppl 2):Google Scholar
- Hersh W, Voorhees E: TREC genomics special issue overview. Information Retrieval 2008, 12(1):1–15. 10.1007/s10791-008-9076-6View ArticleGoogle Scholar
- The PubMed Central Open Access subset[http://www.pubmedcentral.nih.gov/about/openftlist.html]
- Swan A, Brown S: Authors and open access publishing. Learned Publishing 2004, 17(3):219–224. [http://www.ingentaconnect.com/content/alpsp/lp/2004/00000017/00000003/art00007]View ArticleGoogle Scholar
- Eysenbach G: Citation Advantage of Open Access Articles. PLoS Biol 2006, 4(5):e157.PubMed CentralView ArticlePubMedGoogle Scholar
- Marcus MP, Marcinkiewicz MA, Santorini B: Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics 1993, 19(2):313–330. [http://www.aclweb.org/anthology/J/J93/J93–2004.pdf]Google Scholar
- Palmer M, Kingsbury P, Gildea D: The Proposition Bank: an annotated corpus of semantic roles. Computational Linguistics 2005, 31: 71–106. [http://www.aclweb.org/anthology/J/J05/J05–1004.pdf]View ArticleGoogle Scholar
- TREC Genomics Track website[http://ir.ohsu.edu/genomics/]
- Knebel A, Morrice N, Cohen P: A novel method to identify protein kinase substrates: eEF2 kinase is phosphorylated and inhibited by SAPK4/p38 delta. The EMBO Journal 2001, 20(16):4360–4369.PubMed CentralView ArticlePubMedGoogle Scholar
- Curran K, Grainger R: Expression of activated MAP kinase in Xenopus laevis embryos: Evaluating the roles of FGF and other signaling pathways in early induction and patterning. Developmental Biology 2000, 228: 41–56.View ArticlePubMedGoogle Scholar
- Cohen KB, Palmer M, Hunter L: Nominalization and alternations in biomedical language. PLoS ONE 2008, 3(9):e3158.PubMed CentralView ArticlePubMedGoogle Scholar
- Biber D, Johansson S, Leech G, Conrad S, Finegan E: Longman grammar of spoken and written English. Pearson. 1999.Google Scholar
- Kullback S, Leibler RA: On information and sufficiency. Annals of Mathematical Statistics 1951, 22: 79–86.View ArticleGoogle Scholar
- Dunning T: Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 1993, 19: 61–74. [http://www.aclweb.org/anthology-new/J/J93/J93–1003.pdf]Google Scholar
- Rayson P, Garside R: Comparing corpora using frequency profiling. Proceedings of the Workshop on Comparing Corpora, held in conjunction ACL 2000. October 2000, Hong Kong 2000, 1–6. [http://www.aclweb.org/anthology/W/W00/W00–0901.pdf]Google Scholar
- Mouse Genome Institute's Gene Ontology annotation file[http://cvsweb.geneontology.org/cgi-bin/cvsweb.cgi/go/gene-associations/gene_association.mgi.gz?rev=HEAD]
- Ferrucci D, Lally A: Building an example application with the unstructured information management architecture. IBM Systems Journal 2004, 43(3):455–475.View ArticleGoogle Scholar
- The Unstructured Information Management Architecture[http://incubator.apache.org/uima]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.