Skip to main content
  • Research article
  • Open access
  • Published:

From POS tagging to dependency parsing for biomedical event extraction



Given the importance of relation or event extraction from biomedical research publications to support knowledge capture and synthesis, and the strong dependency of approaches to this information extraction task on syntactic information, it is valuable to understand which approaches to syntactic processing of biomedical text have the highest performance.


We perform an empirical study comparing state-of-the-art traditional feature-based and neural network-based models for two core natural language processing tasks of part-of-speech (POS) tagging and dependency parsing on two benchmark biomedical corpora, GENIA and CRAFT. To the best of our knowledge, there is no recent work making such comparisons in the biomedical context; specifically no detailed analysis of neural models on this data is available. Experimental results show that in general, the neural models outperform the feature-based models on two benchmark biomedical corpora GENIA and CRAFT. We also perform a task-oriented evaluation to investigate the influences of these models in a downstream application on biomedical event extraction, and show that better intrinsic parsing performance does not always imply better extrinsic event extraction performance.


We have presented a detailed empirical study comparing traditional feature-based and neural network-based models for POS tagging and dependency parsing in the biomedical context, and also investigated the influence of parser selection for a biomedical event extraction downstream task.

Availability of data and materials

We make the retrained models available at


The biomedical literature, as captured in the parallel repositories of PubMedFootnote 1 (abstracts) and PubMed CentralFootnote 2 (full text articles), is growing at a remarkable rate of over one million publications per year. Effort to catalog the key research results in these publications demands automation [1]. Hence extraction of relations and events from the published literature has become a key focus of the biomedical natural language processing community.

Methods for information extraction typically make use of linguistic information, with a specific emphasis on the value of dependency parses. A number of linguistically-annotated resources, notably including the GENIA [2] and CRAFT [3] corpora, have been produced to support development and evaluation of natural language processing (NLP) tools over biomedical publications, based on the observation of the substantive differences between these domain texts and general English texts, as captured in resources such as the Penn Treebank [4] that are standardly used for development and evaluation of syntactic processing tools. Recent work on biomedical relation extraction has highlighted the particular importance of syntactic information [5]. Despite this, that work, and most other related work, has simply adopted a tool to analyze the syntactic characteristics of the biomedical texts without consideration of the appropriateness of the tool for these texts. A commonly used tool is the Stanford CoreNLP dependency parser [6], although domain-adapted parsers (e.g. [7]) are sometimes used.

Prior work on the CRAFT treebank demonstrated substantial variation in the performance of syntactic processing tools for that data [3]. Given the significant improvements in parsing performance in the last few years, thanks to renewed attention to the problem and exploration of neural methods, it is important to revisit whether the commonly used tools remain the best choices for syntactic analysis of biomedical texts. In this paper, we therefore investigate current state-of-the-art (SOTA) approaches to dependency parsing as applied to biomedical texts. We also present detailed results on the precursor task of POS tagging, since parsing depends heavily on POS tags. Finally, we study the impact of parser choice on biomedical event extraction, following the structure of the extrinsic parser evaluation shared task (EPE 2017) for biomedical event extraction [8]. We find that differences in overall intrinsic parser performance do not consistently explain differences in information extraction performance.

Experimental methodology

In this section, we present our empirical approach to evaluate different POS tagging and dependency parsing models on benchmark biomedical corpora. Fig. 1 illustrates our experimental flow. In particular, we compare pre-trained and retrained POS taggers, and investigate the effect of these pre-trained and retrained taggers in pre-trained parsing models (in the first five rows of Table 4). We then compare the performance of retrained parsing models to the pre-trained ones (in the last ten rows of Table 4). Finally, we investigate the influence of pre-trained and retrained parsing models in the biomedical event extraction task (in Table 11).

Fig. 1
figure 1

Diagram outlining the design of experiments


We use two biomedical corpora: GENIA [2] and CRAFT [3]. GENIA includes abstracts from PubMed, while CRAFT includes full text publications. It has been observed that there are substantial linguistic differences between the abstracts and the corresponding full text publications [9]; hence it is important to consider both contexts when assessing NLP tools in biomedical domain.

The GENIA corpus contains 18K sentences (486K words) from 1999 Medline abstracts, which are manually annotated following the Penn Treebank (PTB) bracketing guidelines [2]. On this treebank, we use the training, development and test split from [10]Footnote 3. We then use the Stanford constituent-to-dependency conversion toolkit (v3.5.1) to generate dependency trees with basic Stanford dependencies [11].

The CRAFT corpus includes 21K sentences (561K words) from 67 full-text biomedical journal articlesFootnote 4. These sentences are syntactically annotated using an extended PTB tag set. Given this extended set, the Stanford conversion toolkit is not suitable for generating dependency trees. Hence, a dependency treebank using the CoNLL 2008 dependencies [12] was produced from the CRAFT treebank using ClearNLP [13]; we directly use this dependency treebank in our experiments. We use sentences from the first 6 files (PubMed IDs: 11532192–12585968) for development and sentences from the next 6 files (PubMed IDs: 12925238–15005800) for testing, while the the remaining 55 files are used for training.

Table 1 gives an overview of the experimental datasets, while Table 2 details corpus statistics. We also include out-of-vocabulary (OOV) rate in Table 1. OOV rate is relevant because if a word has not been observed in the training data at all, the tagger/parser is limited to using contextual clues to resolve the label (i.e. it has observed no prior usage of the word during training and hence has no experience with the word to draw on).

Table 1 The number of files (#file), sentences (#sent), word tokens (#token) and out-of-vocabulary (OOV) percentage in each experimental dataset
Table 2 Statistics by the most frequent dependency and overlapped POS labels, sentence length (i.e. number of words in the sentence) and relative dependency distances ij from a dependent wi to its head wj

POS tagging models

We compare SOTA feature-based and neural network-based models for POS tagging over both GENIA and CRAFT. We consider the following:

  • MarMoT [14] is a well-known generic CRF framework as well as a leading POS and morphological taggerFootnote 5.

  • NLP4J’s POS tagging model [15] (NLP4J-POS) is a dynamic feature induction model that automatically optimizes feature combinationsFootnote 6. NLP4J is the successor of ClearNLP.

  • BiLSTM-CRF [16] is a sequence labeling model which extends a standard BiLSTM neural network [17, 18] with a CRF layer [19].

  • BiLSTM-CRF+CNN-char extends the model BiLSTM-CRF with character-level word embeddings. For each word token, its character-level word embedding is derived by applying a CNN to the word’s character sequence [20].

  • BiLSTM-CRF+LSTM-char also extends the BiLSTM-CRF model with character-level word embeddings, which are derived by applying a BiLSTM to each word’s character sequence [21].

For the three BiLSTM-CRF-based sequence labeling models, we use a performance-optimized implementation from [22]Footnote 7. As detailed later in the “POS tagging results” section, we use NLP4J-POS to predict POS tags on development and test sets and perform 20-way jackknifing [23] to generate POS tags on the training set for dependency parsing.

Dependency parsers

Our second study assesses the performance of SOTA dependency parsers, as well as commonly used parsers, on biomedical texts. Prior work on the CRAFT treebank identified the domain-retrained ClearParser [24], now part of the NLP4J toolkit [25], as a top-performing system for dependency parsing over that data. It remains the best performing non-neural model for dependency parsing. In particular, we compare the following parsers:

  • The Stanford neural network dependency parser [6] (Stanford-NNdep) is a greedy transition-based parsing model which concatenates word, POS tag and arc label embeddings into a single vector, and then feeds this vector into a multi-layer perceptron with one hidden layer for transition classificationFootnote 8.

  • NLP4J’s dependency parsing model [26] (NLP4J-dep) is a transition-based parser with a selectional branching method that uses confidence estimates to decide when employing a beamFootnote 9.

  • jPTDP v1 [27] is a joint model for POS tagging and dependency parsing,Footnote 10 which uses BiLSTMs to learn feature representations shared between POS tagging and dependency parsing. jPTDP can be viewed as an extension of the graph-based dependency parser bmstparser [28], replacing POS tag embeddings with LSTM-based character-level word embeddings. For jPTDP, we train with gold standard POS tags.

  • The Stanford “Biaffine” parser v1 [29] extends bmstparser with biaffine classifiers to predict dependency arcs and labels, obtaining the highest parsing result to date on the benchmark English PTB. The Stanford Biaffine parser v2 [30], further extends v1 with LSTM-based character-level word embeddings, obtaining the highest result (i.e., 1st place) at the CoNLL 2017 shared task on multilingual dependency parsing [31]. We use the Stanford Biaffine parser v2 in our experimentsFootnote 11.

Implementation details

We use the training set to learn model parameters while we tune the model hyper-parameters on the development set. Then we report final evaluation results on the test set. The metric for POS tagging is the accuracy. The metrics for dependency parsing are the labeled attachment score (LAS) and unlabeled attachment score (UAS): LAS is the proportion of words which are correctly assigned both dependency arc and label while UAS is the proportion of words for which the dependency arc is assigned correctly.

For the three BiLSTM-CRF-based models, Stanford-NNdep, jPTDP and Stanford-Biaffine which utilizes pre-trained word embeddings, we employ 200-dimensional pre-trained word vectors from [32]. These pre-trained vectors were obtained by training the Word2Vec skip-gram model [33] on a PubMed abstract corpus of 3 billion word tokens.

For the traditional feature-based models MarMoT, NLP4J-POS and NLP4J-dep, we use their original pure Java implementations with default hyper-parameter settings.

For the BiLSTM-CRF-based models, we use default hyper-parameters provided in [22] with the following exceptions: for training, we use Nadam [34] and run for 50 epochs. We perform a grid search of hyper-parameters to select the number of BiLSTM layers from {1,2} and the number of LSTM units in each layer from {100, 150, 200, 250, 300}. Early stopping is applied when no performance improvement on the development set is obtained after 10 contiguous epochs.

For Stanford-NNdep, we select the wordCutOff from {1,2} and the size of the hidden layer from {100, 150, 200, 250, 300, 350, 400} and fix other hyper-parameters with their default values.

For jPTDP, we use 50-dimensional character embeddings and fix the initial learning rate at 0.0005. We also fix the number of BiLSTM layers at 2 and select the number of LSTM units in each layer from {100,150,200,250,300}. Other hyper-parameters are set at their default values.

For Stanford-Biaffine, we use default hyper-parameter values [30]. These default values can be considered as optimal ones as they helped producing the highest scores for 57 test sets (including English test sets) and second highest scores for 14 test sets over total 81 test sets across 45 different languages at the CoNLL 2017 shared task [31].

Main results

POS tagging results

Table 3 presents POS tagging accuracy of each model on the test set, based on retraining of the POS tagging models on each biomedical corpus. The penultimate row presents the result of the pre-trained Stanford POS tagging model english-bidirectional-distsim.tagger [35], trained on a larger corpus of sections 0–18 (about 38K sentences) of English PTB WSJ text; given the use of newswire training data, it is unsurprising that this model produces lower accuracy than the retrained tagging models. The final row includes published results of the GENIA POS tagger [36], when trained on 90% of the GENIA corpus (cf. our 85% training set)Footnote 12. It does not support a (re)-training process.

In general, we find that the six retrained models produce competitive results. BiLSTM-CRF and MarMoT obtain the lowest scores on GENIA and CRAFT, respectively. jPTDP obtains a similar score to MarMoT on GENIA and similar score to BiLSTM-CRF on CRAFT. In particular, MarMoT obtains accuracy results at 98.61% and 97.07% on GENIA and CRAFT, which are about 0.2% and 0.4% absolute lower than NLP4J-POS, respectively. NLP4J-POS uses additional features based on Brown clusters [37] and pre-trained word vectors learned from a large external corpus, providing useful extra information.

BiLSTM-CRF obtains accuracies of 98.44% on GENIA and 97.25% on CRAFT. Using character-level word embeddings helps to produce about 0.5% and 0.3% absolute improvements to BiLSTM-CRF on GENIA and CRAFT, respectively, resulting in the highest accuracies on both experimental corpora. Note that for PTB, CNN-based character-level word embeddings [20] only provided a 0.1% improvement to BiLSTM-CRF [16]. The larger improvements on GENIA and CRAFT show that character-level word embeddings are specifically useful to capture rare or unseen words in biomedical text data. Character-level word embeddings are useful for morphologically rich languages [27, 38], and although English is not morphologically rich, the biomedical domain contains a wide variety of morphological variants of domain-specific terminology [39]. Words tagged incorrectly are largely associated with gold tags NN, JJ and NNS; many are abbreviations which are also out-of-vocabulary. It is typically difficult for character-level word embeddings to capture those unseen abbreviated words [40].

On both GENIA and CRAFT, BiLSTM-CRF with character-level word embeddings obtains the highest accuracy scores. These are just 0.1% absolute higher than the accuracies of NLP4J-POS. Note that small variations in POS tagging performance are not a critical factor in parsing performance [41]. In addition, we find that NLP4J-POS obtains 30-time faster training and testing speed. Hence for the dependency parsing task, we use NLP4J-POS to perform 20-way jackknifing [23] to generate POS tags on training data and to predict POS tags on development and test sets.

Overall dependency parsing results

We present the LAS and UAS scores of different parsing models in Table 4. The first five rows show parsing results on the GENIA test set of “pre-trained” parsers. The first two rows present scores of the pre-trained Stanford NNdep and Biaffine v1 models with POS tags predicted by the pre-trained Stanford tagger [35], while the next two rows 3-4 present scores of these pre-trained models with POS tags predicted by NLP4J-POS. Both pre-trained NNdep and Biaffine models were trained on a dependency treebank of 40K sentences, which was converted from the English PTB sections 2–21. The fifth row shows scores of BLLIP+Bio, the BLLIP reranking constituent parser [42] with an improved self-trained biomedical parsing model [10]. We use the Stanford conversion toolkit (v3.5.1) to generate dependency trees with the basic Stanford dependencies and use the data split on GENIA as used in [10], therefore parsing scores are comparable. The remaining rows show results of our retrained dependency parsing models.

On GENIA, among pre-trained models, BLLIP obtains highest results. This model, unlike the other pre-trained models, was trained using GENIA, so this result is unsurprising. The pre-trained Stanford-Biaffine (v1) model produces lower scores than the pre-trained Stanford-NNdep model on GENIA. It is also unsurprising because the pre-trained Stanford-Biaffine utilizes pre-trained word vectors which were learned from newswire corpora. Note that the pre-trained NNdep and Biaffine models result in no significant performance differences irrespective of the source of POS tags (i.e. the pre-trained Stanford tagger at 98.37% vs. the retrained NLP4J-POS model at 98.80%).

Regarding the retrained parsing models, on both GENIA and CRAFT, Stanford-Biaffine achieves the highest parsing results with LAS at 91.23% and UAS at 92.64% on GENIA, and LAS at 90.77% and UAS at 92.67% on CRAFT, computed without punctuations. Stanford-NNdep obtains the lowest scores; about 3.5% and 5% absolute lower than Stanford-Biaffine on GENIA and CRAFT, respectively. jPTDP is ranked second, obtaining about 1% and 2% lower scores than Stanford-Biaffine and 1.5% and 1% higher scores (without punctuation) than NLP4J-dep on GENIA and CRAFT, respectively. Table 4 also shows that the best parsing model Stanford-Biaffine obtains about 1% absolute improvement when using gold POS tags instead of predicted POS tags.

Parsing result analysis

Here we present a detailed analysis of the parsing results obtained by the retrained models with predicted POS tags. For simplicity, the following more detailed analyses report LAS scores, computed without punctuation. Using UAS scores or computing with punctuation does not reveal any additional information.

Sentence length

Figure 2 presents LAS scores by sentence length in bins of length 10. As expected, all parsers produce better results for shorter sentences on both corpora; longer sentences are likely to have longer dependencies which are typically harder to predict precisely. Scores drop by about 10% for sentences longer than 50 words, relative to short sentences <=10 words. Exceptionally, on GENIA we find lower scores for the shortest sentences than for the sentences from 11 to 20 words. This is probably because abstracts tend not to contain short sentences: (i) as shown in Table 2, the proportion of sentences in the first bin is very low at 3.5% on GENIA (cf. 17.8% on CRAFT), and (ii) sentences in the first bin on GENIA are relatively long, with an average length of 9 words (cf. 5 words in CRAFT).

Fig. 2
figure 2

LAS scores by sentence length. Scores obtained on GENIA and CRAFT are presented in the left and right figures, respectively

Dependency distance

Figure 3 shows LAS (F1) scores corresponding to the dependency distance ij, between a dependent wi and its head wj, where i and j are consecutive indices of words in a sentence. Short dependencies are often modifiers of nouns such as determiners or adjectives or pronouns modifying their direct neighbors, while longer dependencies typically represent modifiers of the root or the main verb [43]. All parsers obtain higher scores for left dependencies than for right dependencies. This is not completely unexpected as English is strongly head-initial. In addition, the gaps between LSTM-based models (i.e. Stanford-Biaffine and jPTDP) and non-LSTM models (i.e. NLP4J-dep and Stanford-NNdep) are larger for the long dependencies than for the shorter ones, as LSTM architectures can preserve long range information [44].

Fig. 3
figure 3

LAS (F1) scores by dependency distance. Scores obtained on GENIA and CRAFT are presented in the left and right figures, respectively

On both corpora, higher scores are also associated with shorter distances. There is one surprising exception: on GENIA, in distance bins of −4, −5 and <−5, Stanford-Biaffine and jPTDP obtain higher scores for longer distances. This may result from the structural characteristics of sentences in the GENIA corpus. Table 5 details the scores of Stanford-Biaffine in terms of the most frequent dependency labels in these left-most dependency bins. We find amod and nn are the two most difficult to predict dependency relations (the same finding applied to jPTDP). They appear much more frequently in the bins −4 and −5 than in bin <−5, explaining the higher overall score for bin <−5.

Table 5 LAS (F1) scores of Stanford-Biaffine on GENIA, by frequent dependency labels in the left dependencies

Dependency label

Tables 6 and 7 present LAS scores for the most frequent dependency relation types on GENIA and CRAFT, respectively. In most cases, Stanford-Biaffine obtains the highest score for each relation type on both corpora with the following exceptions: on GENIA, jPTDP gets the highest results to aux, dep and nn (as well as nsubjpass), while NLP4J-dep and NNdep obtain the highest scores for auxpass and num, respectively. On GENIA the labels associated with the highest average LAS scores (generally >90%) are amod, aux, auxpass, det, dobj, mark, nsubj, nsubjpass, pobj and root whereas on CRAFT they are NMOD, OBJ, PMOD, PRD, ROOT, SBJ, SUB and VC. These labels either correspond to short dependencies (e.g. aux, auxpass and VC), have strong lexical indications (e.g. det, pobj and PMOD), or occur very often (e.g. amod, subj, NMOD and SBJ).

Table 6 LAS by the basic Stanford dependency labels on GENIA
Table 7 LAS by the CoNLL 2008 dependency labels on CRAFT

Those relation types with the lowest LAS scores (generally <70%) are dep on GENIA and DEP, LOC, PRN and TMP on CRAFT; dep/DEP are very general labels while LOC, PRN and TMP are among the least frequent labels. Those types also associate to the biggest variation of obtained accuracy across parsers (>8%). In addition, the coordination-related labels cc, conj/CONJ and COORD show large variation across parsers. These 9 mentioned relation labels generally correspond to long dependencies. Therefore, it is not surprising that BiLSTM-based models Stanford-Biaffine and jPTDP can produce much higher accuracies on these labels than non-LSTM models NLP4J-dep and NNdep.

The remaining types are either relatively rare labels (e.g. appos, num and AMOD) or more frequent labels but with a varied distribution of dependency distances (e.g. advmod, nn, and ADV).

POS tag of the dependent

Table 8 analyzes the LAS scores by the most frequent POS tags (across two corpora) of the dependent. Stanford-Biaffine achieves the highest scores on all these tags except TO where the traditional feature-based model NLP4J-dep obtains the highest score (TO is relatively rare tag in GENIA and is the least frequent tag in CRAFT among tags listed in Table 8). Among listed tags VBG is the least and second least frequent one in GENIA and CRAFT, respectively, and generally associates to longer dependency distances. So, it is reasonable that the lowest scores we obtain on both corpora are accounted for by VBG. The coordinating conjunction tag CC also often corresponds to long dependencies, thus resulting in biggest ranges across parsers on both GENIA and CRAFT. The results for CC are consistent with the results obtained for the dependency labels cc in Table 6 and COORD in Table 7 because they are coupled to each other.

Table 8 LAS by POS tag of the dependent

On the remaining POS tags, we generally find similar patterns across parsers and corpora, except for IN and VB where parsers produce 8+% higher scores for IN on GENIA than on CRAFT, and vice versa producing 9+% lower scores for VB on GENIA. This is because on GENIA, IN is mostly coupled with the dependency label prep at a rate of 90% (thus their corresponding LAS scores in tables 8 and 6 are consistent), while on CRAFT IN is coupled to a more varied distribution of dependency labels such as ADV with a rate at 20%, LOC at 14%, NMOD at 40% and TMP at 5%. Regarding VB, on CRAFT it usually associates to a short dependency distance of 1 word (i.e. head and dependent words are next to each other) with a rate at 80%, and to a distance of 2 words at 15%, while on GENIA it associates with longer dependency distances with a rate at 17% for the distance of 1 word, 31% for the distance of 2 words and 34% for a distance of >5 words. So, parsers obtain much higher scores for VB on CRAFT than on GENIA.

Error analysis

We analyze token-level parsing errors that occur consistently across all parsers (i.e. the intersection set of errors), and find that there are few common error patterns. The first one is related to incorrect POS tag prediction (8% of the intersected parsing errors on GENIA and 12% on CRAFT are coupled with incorrect predicted POS tags). For example, the word token “domains” is the head of the phrase “both the POU(S) and POU(H) domains” in Table 9. We also have two OOV word tokens “POU(S)” and “POU(H)” which abbreviate “POU-specific” and “POU homeodomain”, respectively. NLP4J-POS (as well as all other POS taggers) produced an incorrect tag of NN rather than adjective (JJ) for “POU(S)”. As “POU(S)” is predicted to be a noun, all parsers make an incorrect prediction that it is the phrasal head, thus also resulting in errors to remaining dependent words in the phrase.

Table 9 Error examples

The second error type occurs on noun phrases such as “the Oct-1-responsive octamer sequence ATGCAAAT” (in Table 9) and “the herpes simplex virus Oct-1 coregulator VP16”, commonly referred to as appositive structures, where the second to last noun (i.e. “sequence” and “coregulator”) is considered to be the phrasal head, rather than the last noun. However, such phrases are relatively rare and all parsers predict the last noun as the head.

The third error type is related to the relation labels dep/DEP. We manually re-annotate every case where all parsers agree on the dependency label for a dependency arc with the same dependency label, where this label disagrees with the gold label dep/DEP (these cases are about 3.5% of the parsing errors intersected across all parsers on GENIA and 0.5% on CRAFT). Based on this manual review, we find that about 80% of these cases appear to be labelled correctly, despite not agreeing with the gold standard. In other words, the gold standard appears to be in error in these cases. This result is not completely unexpected because when converting from constituent treebanks to dependency treebanks, the general dependency label dep/DEP is usually assigned due to limitations in the automatic conversion toolkit.

Parser comparison on event extraction

We present an extrinsic evaluation of the four dependency parsers for the downstream task of biomedical event extraction.

Evaluation setup

Previously, Miwa et al. [45] adopted the BioNLP 2009 shared task on biomedical event extraction [46] to compare the task-oriented performance of six “pre-trained” parsers with 3 different types of dependency representations. However, their evaluation setup requires use of a currently unavailable event extraction system. Fortunately, the extrinsic parser evaluation (EPE 2017) shared task aimed to evaluate different dependency representations by comparing their performance on downstream tasks [47], including a biomedical event extraction task [8]. We thus follow the experimental setup used there; employing the Turku Event Extraction System (TEES, [48]) to assess the impact of parser differences on biomedical relation extractionFootnote 13.

EPE 2017 uses the BioNLP 2009 shared task dataset [46], which was derived from the GENIA treebank corpus (800, 150 and 260 abstract files used for BioNLP 2009 training, development and test, respectively)Footnote 14. We only need to provide dependency parses of raw texts using the pre-processed tokenized and sentence-segmented data provided by the EPE 2017 shared task. For the Stanford-Biaffine, NLP4J-dep and Stanford-NNdep parsers that require predicted POS tags, we use the retrained NLP4J-POS model to generate POS tags. We then produce parses using retrained dependency parsing models.

TEES is then trained for the BioNLP 2009 Task 1 using the training data, and is evaluated on the development data (gold event annotations are only available to public for training and development sets). To obtain test set performance, we use an online evaluation system. The online evaluation system for the BioNLP 2009 shared task is currently not available. Therefore, we employ the online evaluation system for the BioNLP 2011 shared task [49] with the “abstracts only” optionFootnote 15. The score is reported using the approximate span & recursive evaluation strategy [46].

Impact of parsing on event extraction

Table 10 presents the intrinsic UAS and LAS (F1) scores on the pre-processed segmented BioNLP 2009 development sentences (i.e. scores with respect to predicted segmentation), for which these sentences contain event interactions. These scores are higher than those presented in Table 4 because most part of the BioNLP 2009 dataset is extracted from the GENIA treebank training set. Although gold event annotations in the BioNLP 2009 test set are not available to public, it is likely that we would obtain the similar intrinsic UAS and LAS scores on the pre-processed segmented test sentences containing event interactions.

Table 10 UAS and LAS (F1) scores of re-trained models on the pre-segmented BioNLP-2009 development sentences which contain event interactions
Table 11 Biomedical event extraction results

Table 11 compares parsers with respect to the EPE 2017 biomedical event extraction task [8]. The first row presents the score of the Stanford&Paris team [50]; the highest official score obtained on the test set. Their system used the Stanford-Biaffine parser (v2) trained on a dataset combining PTB, Brown corpus, and GENIA treebank dataFootnote 16. The second row presents our score for the pre-trained BLLIP+Bio model; remaining rows show scores using re-trained parsing models.

The results for parsers trained with the GENIA treebank (Rows 1-6, Table 11) are generally higher than for parsers trained on CRAFT. This is logical because the BioNLP 2009 shared task dataset was a subset of the GENIA corpus. However, we find that the differences in intrinsic parsing results as presented in Tables 4 and 10 do not consistently explain the differences in extrinsic biomedical event extraction performance, extending preliminary related observations in prior work [51, 52]. Among the four dependency parsers trained on GENIA, Stanford-Biaffine, jPTDP and NLP4J-dep produce similar event extraction scores on the development set, while on the the test set jPTDP and NLP4J-dep obtain the lowest and highest scores, respectively.

Table 11 also summarizes the results with the dependency structures only (i.e. results without dependency relation labels; replacing all predicted dependency labels by “UNK” before training TEES). In most cases, compared to using dependency labels, event extraction scores drop on the development set (except NLP4J-dep trained on CRAFT), while they increase on the test set (except NLP4J-dep trained on GENIA and Stanford-NNdep trained on CRAFT). Without dependency labels, better event extraction scores on the development set corresponds to better scores on the test set. In addition, the differences in these event extraction scores without dependency labels are more consistent with the parsing performance differences than the scores with dependency labels.

These findings show that variations in dependency representations strongly affect event extraction performance. Some (predicted) dependency labels are likely to be particularly useful for extracting events, while others hurt performance. Also, investigating 20 frequent dependency labels in each dataset as well as some possible combinations between them could lead to an enormous number of additional experiments. We believe a detailed analysis of the interaction between those labels in a downstream application task deserves another research paper with a more careful analysis. Here, one contribution of our paper could be seen to be that we highlight the need for further research in this direction.


We have presented a detailed empirical study comparing SOTA traditional feature-based and neural network-based models for POS tagging and dependency parsing in the biomedical context. In general, the neural models outperform the feature-based models on two benchmark biomedical corpora GENIA and CRAFT. In particular, BiLSTM-CRF-based models with character-level word embeddings produce highest POS tagging accuracies which are slightly better than NLP4J-POS, while the Stanford-Biaffine parsing model obtains significantly better result than other parsing models.

We also investigate the influence of parser selection for a biomedical event extraction downstream task, and show that better intrinsic parsing performance does not always imply better extrinsic event extraction performance. Whether this pattern holds for other information extraction tasks is left as future work.













  12. Trained on the PTB sections 0–18, the accuracies for the GENIA tagger, Stanford tagger, MarMoT, NLP4J-POS, BiLSTM-CRF and BiLSTM-CRF+CNN-char on the benchmark test set of PTB sections 22-24 were reported at 97.05%, 97.23%, 97.28%, 97.64%, 97.45% and 97.55%, respectively.

    Table 3 POS tagging accuracies on the test set with gold tokenization
    Table 4 Parsing results on the test set with predicted POS tags and gold tokenization (except [\(\mathcal {G}\)] which denotes results when employing gold POS tags in both training and testing phases)

  14. 678 of 800 training, 132 of 150 development and 248 of 260 test files are included in the GENIA treebank training set.


  16. The EPE 2017 shared task [47] focused on evaluating different dependency representations in downstream tasks, not on comparing different parsers. Therefore each participating team employed only one parser, either a dependency graph or tree parser. Only the Stanford&Paris team [50] employ GENIA data, obtaining the highest biomedical event extraction score.



Bidirectional LSTM


Convolutional neural network


Conditional random field


Extrinsic parser evaluation


Labeled attachment score


Long short-term memory


Natural language processing






Penn treebank




Unlabeled attachment score


Wall street journal


  1. Baumgartner W, Cohen K, Fox L, Acquaah-Mensah G, Hunter L. Manual curation is not sufficient for annotation of genomic databases. Bioinformatics. 2007; 23(13):41–8.

    Article  Google Scholar 

  2. Tateisi Y, Yakushiji A, Ohta T, Tsujii J. Syntax Annotation for the GENIA Corpus. In: Proceedings of the Second International Joint Conference on Natural Language Processing: Companion Volume: 2005. p. 220–5.

  3. Verspoor K, Cohen KB, Lanfranchi A, Warner C, Johnson HL, Roeder C, Choi JD, Funk C, Malenkiy Y, Eckert M, Xue N, Baumgartner WA, Bada M, Palmer M, Hunter LE. A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools. BMC Bioinformatics. 2012; 13(1):207.

    Article  Google Scholar 

  4. Marcus MP, Santorini B, Marcinkiewicz MA. Building a Large Annotated Corpus of English: The Penn Treebank. Comput Linguis. 1993; 19(2):313–30.

    Google Scholar 

  5. Peng N, Poon H, Quirk C, Toutanova K, Yih W-t. Cross-Sentence N-ary Relation Extraction with Graph LSTMs. Trans Assoc Comput Linguis. 2017; 5:101–15.

    Article  Google Scholar 

  6. Chen D, Manning C. A Fast and Accurate Dependency Parser using Neural Networks. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing: 2014. p. 740–50.

  7. McClosky D, Charniak E. Self-training for biomedical parsing. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers: 2008. p. 101–4.

  8. Björne J, Ginter F, Salakoski T. EPE 2017: The Biomedical Event Extraction Downstream Application. In: Proceedings of the 2017 Shared Task on Extrinsic Parser Evaluation: 2017. p. 17–24.

  9. Cohen KB, Johnson H, Verspoor K, Roeder C, Hunter L. The structural and content aspects of abstracts versus bodies of full text journal articles are different. BMC Bioinformatics. 2010; 11(1):492.

    Article  Google Scholar 

  10. McClosky D. Any Domain Parsing: Automatic Domain Adaptation for Natural Language Parsing. 2010. PhD thesis, Department of Computer Science, Brown University.

  11. de Marneffe M-C, Manning CD. The Stanford Typed Dependencies Representation. In: Proceedings of the Workshop on Cross-Framework and Cross-Domain Parser Evaluation: 2008. p. 1–8.

  12. Surdeanu M, Johansson R, Meyers A, Màrquez L, Nivre J. The CoNLL 2008 Shared Task on Joint Parsing of Syntactic and Semantic Dependencies. In: Proceedings of the Twelfth Conference on Computational Natural Language Learning: 2008. p. 159–77.

  13. Choi JD, Palmer M. Guidelines for the CLEAR Style Constituent to Dependency Conversion. 2012. Technical report, Institute of Cognitive Science, University of Colorado Boulder.

  14. Mueller T, Schmid H, Schütze H. Efficient Higher-Order CRFs for Morphological Tagging. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing: 2013. p. 322–32.

  15. Choi JD. Dynamic Feature Induction: The Last Gist to the State-of-the-Art. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: 2016. p. 271–81.

  16. Huang Z, Xu W, Yu K. Bidirectional LSTM-CRF models for sequence tagging. 2015. 2015;arXiv:1508.01991.

  17. Schuster M, Paliwal KK. Bidirectional recurrent neural networks. IEEE Trans Signal Proc. 1997; 45(11):2673–81.

    Article  Google Scholar 

  18. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997; 9(8):1735–80.

    Article  CAS  Google Scholar 

  19. Lafferty JD, McCallum A, Pereira FCN. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In: Proceedings of the Eighteenth International Conference on Machine Learning: 2001. p. 282–9.

  20. Ma X, Hovy E. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers): 2016. p. 1064–74.

  21. Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C. Neural Architectures for Named Entity Recognition. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: 2016. p. 260–70.

  22. Reimers N, Gurevych I. Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: 2017. p. 338–48.

  23. Koo T, Carreras X, Collins M. Simple Semi-supervised Dependency Parsing. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: 2008. p. 595–603.

  24. Choi JD, Palmer M. Getting the most out of transition-based dependency parsing. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: 2011. p. 687–92.

  25. Choi JD, Tetreault J, Stent A. It Depends: Dependency Parser Comparison Using A Web-based Evaluation Tool. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers): 2015. p. 387–96.

  26. Choi JD, McCallum A. Transition-based Dependency Parsing with Selectional Branching. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers): 2013. p. 1052–62.

  27. Nguyen DQ, Dras M, Johnson M. A Novel Neural Network Model for Joint POS Tagging and Graph-based Dependency Parsing. In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies: 2017. p. 134–42.

  28. Kiperwasser E, Goldberg Y. Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations. Trans Assoc Comput Linguist. 2016; 4:313–27.

    Article  Google Scholar 

  29. Dozat T, Manning CD. Deep Biaffine Attention for Neural Dependency Parsing. In: Proceedings of the 5th International Conference on Learning Representations: 2017.

  30. Dozat T, Qi P, Manning CD. Stanford’s Graph-based Neural Dependency Parser at the CoNLL 2017 Shared Task. In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies: 2017. p. 20–30.

  31. Zeman D, Popel M, Straka M, Hajic J, Nivre J, Ginter F, Luotolahti J, Pyysalo S, Petrov S, Potthast M, Tyers F, Badmaeva E, Gokirmak M, Nedoluzhko A, Cinkova S, Hajic jr J, Hlavacova J, Kettnerová V, Uresova Z, Kanerva J, Ojala S, Missilä A, Manning CD, Schuster S, Reddy S, Taji D, Habash N, Leung H, de Marneffe M-C, Sanguinetti M, Simi M, Kanayama H, dePaiva V, Droganova K, Martínez Alonso H, Çöltekin c, Sulubacak U, Uszkoreit H, Macketanz V, Burchardt A, Harris K, Marheinecke K, Rehm G, Kayadelen T, Attia M, Elkahky A, Yu Z, Pitler E, Lertpradit S, Mandl M, Kirchner J, Alcalde HF, Strnadová J, Banerjee E, Manurung R, Stella A, Shimada A, Kwak S, Mendonca G, Lando T, Nitisaroj R, Li J. CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies: 2017. p. 1–19.

  32. Chiu B, Crichton G, Korhonen A, Pyysalo S. How to Train good Word Embeddings for Biomedical NLP. In: Proceedings of the 15th Workshop on Biomedical Natural Language Processing: 2016. p. 166–74.

  33. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed Representations of Words and Phrases and their Compositionality. In: Advances in Neural Information Processing Systems 26: 2013. p. 3111–9.

  34. Dozat T. Incorporating Nesterov Momentum into Adam. In: Proceedings of the ICLR 2016 Workshop Track: 2016.

  35. Toutanova K, Klein D, Manning CD, Singer Y. Feature-rich Part-of-speech Tagging with a Cyclic Dependency Network. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1: 2003. p. 173–80.

  36. Tsuruoka Y, Tateishi Y, Kim J-D, Ohta T, McNaught J, Ananiadou S, Tsujii J. Developing a robust part-of-speech tagger for biomedical text. In: Advances in Informatics: 2005. p. 382–92.

  37. Brown PF, deSouza PV, Mercer RL, Pietra VJD, Lai JC. Class-based N-gram Models of Natural Language. Comput Linguist. 1992; 18(4):467–79.

    Google Scholar 

  38. Plank B, Søgaard A, Goldberg Y. Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers): 2016. p. 412–8.

  39. Liu H, Christiansen T, Baumgartner WA, Verspoor K. BioLemmatizer: a lemmatization tool for morphological processing of biomedical text. J Biomed Semant. 2012; 3(1):3.

    Article  Google Scholar 

  40. Habibi M, Weber L, Neves M, Wiegandt DL, Leser U. Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics. 2017; 33(14):37–48.

    Article  Google Scholar 

  41. Seddah D, Chrupała G, Cetinoglu O, van Genabith J, Candito M. Lemmatization and lexicalized statistical parsing of morphologically-rich languages: the case of french. In: Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages: 2010. p. 85–93.

  42. Charniak E, Johnson M. Coarse-to-fine n-best parsing and maxent discriminative reranking. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics: 2005. p. 173–80.

  43. McDonald R, Nivre J. Characterizing the Errors of Data-Driven Dependency Parsing Models. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning: 2007. p. 122–31.

  44. Graves A. Supervised sequence labelling with recurrent neural networks. 2008. PhD thesis, Technical University Munich.

  45. Miwa M, Pyysalo S, Hara T, Tsujii J. Evaluating Dependency Representations for Event Extraction. In: Proceedings of the 23rd International Conference on Computational Linguistics: 2010. p. 779–87.

  46. Kim J-D, Ohta T, Pyysalo S, Kano Y, Tsujii J. Overview of BioNLP’09 Shared Task on Event Extraction. In: Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task: 2009. p. 1–9.

  47. Oepen S, Ovrelid L, Björne J, Johansson R, Lapponi E, Ginter F, Velldal E. The 2017 Shared Task on Extrinsic Parser Evaluation Towards a Reusable Community Infrastructure. In: Proceedings of the 2017 Shared Task on Extrinsic Parser Evaluation: 2017. p. 1–16.

  48. Björne J, Heimonen J, Ginter F, Airola A, Pahikkala T, Salakoski T. Extracting complex biological events with rich graph-based feature sets. In: Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task: 2009. p. 10–8.

  49. Kim J-D, Pyysalo S, Ohta T, Bossy R, Nguyen N, Tsujii J. Overview of BioNLP Shared Task 2011. In: Proceedings of BioNLP Shared Task 2011 Workshop: 2011. p. 1–6.

  50. Schuster S, Clergerie EDL, Candito M, Sagot B, Manning CD, Seddah D. Paris and Stanford at EPE 2017: Downstream Evaluation of Graph-based Dependency Representations. In: Proceedings of the 2017 Shared Task on Extrinsic Parser Evaluation: 2017. p. 47–59.

  51. Nguyen DQ, Verspoor K. An improved neural network model for joint POS tagging and dependency parsing. In: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies: 2018. p. 81–91.

  52. MacKinlay A, Martinez D, Jimeno Yepes A, Liu H, Wilbur WJ, Verspoor K. Extracting biomedical events and modifications using subgraph matching with noisy training data. In: Proceedings of the BioNLP Shared Task 2013 Workshop: 2013. p. 35–44.

Download references


This research was also supported by use of the Nectar Research Cloud, a collaborative Australian research platform supported by the National Collaborative Research Infrastructure Strategy (NCRIS).


This work was supported by the ARC Discovery Project DP150101550 and ARC Linkage Project LP160101469.

Availability of data and materials

We make the retrained models available at

Author information

Authors and Affiliations



DQN designed and conducted all the experiments, and drafted the manuscript. KV contributed to the manuscript and provided valuable comments on the design of the experiments. Both authors have read and approved this manuscript.

Corresponding author

Correspondence to Dat Quoc Nguyen.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver( applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nguyen, D., Verspoor, K. From POS tagging to dependency parsing for biomedical event extraction. BMC Bioinformatics 20, 72 (2019).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: