Parallel sequence tagging for concept recognition

Background Named Entity Recognition (NER) and Normalisation (NEN) are core components of any text-mining system for biomedical texts. In a traditional concept-recognition pipeline, these tasks are combined in a serial way, which is inherently prone to error propagation from NER to NEN. We propose a parallel architecture, where both NER and NEN are modeled as a sequence-labeling task, operating directly on the source text. We examine different harmonisation strategies for merging the predictions of the two classifiers into a single output sequence. Results We test our approach on the recent Version 4 of the CRAFT corpus. In all 20 annotation sets of the concept-annotation task, our system outperforms the pipeline system reported as a baseline in the CRAFT shared task, a competition of the BioNLP Open Shared Tasks 2019. We further refine the systems from the shared task by optimising the harmonisation strategy separately for each annotation set. Conclusions Our analysis shows that the strengths of the two classifiers can be combined in a fruitful way. However, prediction harmonisation requires individual calibration on a development set for each annotation set. This allows achieving a good trade-off between established knowledge (training set) and novel information (unseen concepts). Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04511-y.

(NEN), linking, or grounding. Typically, the two steps are performed in a sequential manner, using a sequence classifier for NER and a ranking-or rule-based module for NEN. While this approach allows focusing on different methods for the individual steps, it suffers from error propagation, an inherent drawback of any pipeline architecture. For example, a certain NEN system might have excellent accuracy when using ground-truth spans as input, but its performance will decrease when operating on the imperfect output of a span tagger. In particular, a normaliser might be inclined or even forced to predict a concept ID for spurious spans, and it cannot recover from cases where a span is missing.
In this work, we investigate an alternative architecture for concept recognition, which alleviates the problem of error propagation: parallel sequence tagging for NER and NEN. In this architecture, NEN is modeled as a sequence-classification problem (like NER) and applied to the input text independently of the span tagger. The predictions of the two taggers are harmonised using different strategies, the choice of which is a hyperparameter of the complete system. We test our approach with a manually annotated dataset for biomedical concepts, the CRAFT corpus, continuing the efforts from our participation in the CRAFT shared task 2019.

Related Work
Concept recognition has often been approached as a pipeline of NER+NEN. For NER, sequence labeling with conditional random fields (CRF) has dominated the field to present, be it pure CRF as in Gimli [1] or DTMiner [2], on top of a recurrent neural network as in HUNER [3], Saber [4], or DTranNER [5], or even as the head of a BERTbased system as in SciBERT [6]. BERN [7] performs NER by fine-tuning BioBERT alone, even though [8] report improved results when stacking CRF atop BioBERT. Different approaches have been taken to NEN, where extracted mentions are mapped to a vocabulary: exact match as in Neji [9], expert-written rules [10], learning-to-rank as in DNorm [11], linking through an ontology using word embeddings and syntactic re-ranking [12], or sequence-to-sequence prediction [13].
Knowledge-based concept-recognition systems like Jensen tagger [14] or NOBLE coder [15] do not allow for a clear separation between NER and NEN, as span detection and linking happens at once, even if machine-learning components are added for improving accuracy, like for OGER++ [16] or RysannMD [17]. Joint approaches like TaggerOne [18], JLink [19], and others [20,21], however, have separate modules for NER and NEN, which are trained simultaneously. The multi-task sequence labeling architecture for NER and NEN in [21] has been highly inspirational for the present work, although we were unable to reproduce their results, even using the code that the authors made publicly available.

CRAFT corpus and shared task
The Colorado Richly Annotated Full-Text (CRAFT) corpus [22,23] is a collection of 97 scientific articles from the biomedical domain. It is manually annotated for syntactic structure, coreferences, and bio-concepts (entities), the last of which are used in the present study. In the latest release (Version 4), the concept annotations are divided into 10 sets of different entity types, which are provided in two versions each (proper and The extended annotations are referred to by appending EXT to the abbreviations for the proper annotations (CHEBI_EXT, CL_EXT etc.).
The CRAFT corpus has been used in a range of studies. Through repeated improvements and extensions over time, the corpus has become a high-quality resource with rich annotations, but it also led to the situation that most experiments are not directly comparable to each other, as their setup differs in many ways. In the first release of the CRAFT corpus, only 67 articles were available. The remaining 30 documents were not released until the evaluation period of the CRAFT shared task 2019 [32], where they served as a test set. This competition was part of the BioNLP Open Shared Tasks and comprised three core NLP tasks, where participating systems were evaluated against the ground-truth annotations of Version 4 of the CRAFT corpus. However, most prior work on concept recognition was carried out with an older version of CRAFT, i. e. using a different test set, possibly an earlier stage of annotations and a different evaluation method, which means that results are not directly comparable.
While the majority of studies is concerned with concept recognition (i. e. systems that predict IDs), some are restricted to NER, e. g. [4,33,34]. Methodologically, the approaches range from pure dictionary-based [15,35] to entirely example-based systems [36], even though the NEN step almost always includes dictionary lookup. Since no official test set was available prior to Version 4, many experiments use an arbitrary train/ test split [37] or apply evaluation to the entire corpus [9]. The metrics used are consistently precision, recall and F-score, but differences exist with respect to considering partial matches. Also, many studies do not cover the full set of annotations, but rather focus on a small selection of entity types, such as Gene Ontology [38] or gene mentions [33]. Furrer et al. BMC Bioinformatics (2021) 22:623

Methods
We propose a paradigm for biomedical concept recognition where named entity recognition (NER) and normalisation (NEN) are tackled in parallel. In a traditional NER+NEN pipeline, the NEN module is restricted to predict concept labels (IDs) for the spans identified by the NER tagger. In order to avoid the error propagation inherent to this serial approach, we drop this restriction and provide the full input sequence to the normaliser. As such, we cast the normalisation task as a sequence-tagging problem -very much like an NER tagger, but with a considerably larger tag set, consisting of all concept IDs of the training data.

Design implications
Modeling concept normalisation as sequence tagging has a number of drawbacks. As discussed in the next section, the CoNLL representation of the data enforces exactly one label for each token, which disallows learning and predicting annotations with overlapping and discontinuous spans. This representation also entails that the model has to produce a consistent series of individual predictions in order to correctly label a multi-word expression. This often means that highly ambiguous tokens like prepositions, numbers, or single letters must be interpreted correctly in context (e. g. "of " in "inhibitor of calpain", "I" in "hexokinase I"). As the most serious limitation, a sequence tagger can only ever predict labels it has seen during training, which restricts the label set of the trained system to a fraction of the target label set (the ontology) in many cases. Since many concepts occur extremely rarely in the biomedical literature (cf. Fig. 1), this limitation might not critically reduce performance measured on a typical evaluation data set. However, it is highly undesirable to have a tagger that is completely incapable of predicting labels beyond the training set. On the other hand, the ID-tagging architecture is technically an end-to-end conceptrecognition system, i. e. it does not depend on any span predictions, which means that the NER step could potentially be skipped entirely. However, due to the small number of tags, span tagging is far more robust with respect to ambiguous tokens and unseen concepts. By adding span predictions, we might thus be able to overcome the limitations of direct ID tagging. Therefore, we chose to combine the strengths of span and ID tagging by applying both in parallel and merging the results in postprocessing.

Data preparation
Our system processes documents in a variant of the CoNLL format, i. e. a verticalised format where each text token is assigned exactly one label. Based on our architecture with two sequence classifiers, we employed two different label sets. For the span tagger, the text is tagged with IOBES labels, i. e. each token is assigned one of the five labels I, O, B, E, or S. Entities spanning only a single token are annotated with S. For multi-word entities, the first and last token are tagged with B and E, respectively, and any intervening tokens with I. The rest of the text (i. e. all tokens outside of an entity) are annotated with O. For the ID tagger, all tokens of an entity are tagged with the respective concept ID. We added a NIL label to mark non-entity tokens, analogously to the O tag of the span tagger.
This representation does not have the same expressiveness as the stand-off format used in CRAFT, which offers great flexibility for anchoring annotations in the text. In particular, the CRAFT corpus contains discontinuous annotations (multiple non-adjacent text spans for the same annotation), overlapping annotations (words shared by multiple annotations) and sub-word spans (annotation refers to part of a word). Since these complex annotations cannot be represented with token-level labels, their structure needs to be simplified.
In order to measure the performance impact of this simplification, we converted the reference annotations of the training set to CoNLL format and back to stand-off using the standoff2conll suite [39]. This utility offers two strategies for unifying discontinuous annotations (full-span and last-span), to which we added a third option (first-span) [40]. For unnesting overlapping annotations, two strategies are available as well (keeplonger and keep-shorter). The effect of unifying and unnesting annotations is illustrated in Fig. 2. Sub-word annotations are extended to span entire tokens.
After this round-trip conversion, the annotations are run through the official evaluation suite provided by CRAFT [41]. Table 1 shows the results for different combinations of unification and unnesting strategies on the non-extended annotation sets. These numbers mark the upper limit for a system trained on input data in CoNLL format. For all annotation sets, using the first-span and keep-longer strategies achieved the highest F-score.

Architecture
The sequence taggers used in our experiments are built atop a pretrained languagerepresentation model, BioBERT [42], which in turn extends BERT [43]. BERT is an attention-based multi-layer neural network which learns context-dependent word-vector representations. It creates bidirectional contextual representations of a token from unlabeled text conditioned on the left and the right context. BERT is trained to solve two tasks: first, to predict whether two sentences follow each other, and second, to predict a randomly masked token from its context. After a slight modification to its architecture, training of BERT can be continued on a different task like NER; this process is referred to as fine-tuning with a task-specific head.
For our experiments, we downloaded BioBERT v1.1, which includes code, configuration and pretrained parameters. BioBERT is based on BERT BASE , which was pretrained for 1M steps by Devlin et al. [43] on a 3.3B-word corpus from the general domain  To perform NER and NEN in parallel, we used two different tag sets for fine-tuning, as described in the previous section: IOBES labels for the span tagger and the set of all concept IDs for the ID tagger. In addition to that, both taggers used a small set of tags inherited from the original BERT implementation, which flag tokens with a special function, such as padding, sub-word unit and sentence boundary. We trained a pair of span and ID tagger for each annotation set, which resulted in a total of 40 individual models.
The predictions of the span tagger are always aligned with the IDs produced by a dictionary-based concept-recognition system, OGER [16,44]. OGER detects mentions of ontology terms in running text through efficient fuzzy-matching. We manually optimised OGER's configuration on the CRAFT training set. We used no additional terminology resources besides the ontologies provided with the corpus. However, we manually added a handful of synonyms for GO_MF. This combined system resembles a classical NER+NEN pipeline, where the high-recall output of the dictionary-based system is combined with the context-aware span detection using an example based classification model.

Hyperparameter tuning
In order to determine the best hyperparameters for each annotation set, we performed extensive grid search in cross-validation over the training set. In particular, we investigated the following configurations: ontology pretraining: enable/disable abbreviation expansion: enable/disable prediction harmonisation: 6 strategies If ontology pretraining is enabled, the ID classifier is trained on synonym-ID pairs from the terminology for 20 epochs before switching to the actual training corpus. For abbreviation expansion, we first used Ab3P [45] to detect abbreviation definitions, then replaced occurrences of short forms with the corresponding long form. For harmonising the predictions of the two classifiers, we compared six different strategies; these are described in the next section.
From previous experiments [46], we knew that ontology pretraining has a positive effect for some, but a negative effect for other annotation sets. We therefore concluded that hyperparameters had to be tuned individually for each of the 20 annotation sets. In order to obtain reliable figures, we performed 6-fold cross-validation with up to 3 runs for each combination.
As we expected, ontology pretraining yielded a mixed picture. In many cases, a clear decision was not possible, as repeated runs gave contradictory results. Unexpectedly, abbreviation expansion showed a clear improvement only for CL and a slight improvement for GO_MF; in all other cases (including CL_EXT and GO_MF_EXT) the results decreased. We decided to disable both ontology pretraining and abbreviation expansion, as the isolated merits do not justify the added complexity.
For prediction harmonisation, the best strategy for each annotation set is given in Table 2 and discussed in the following section. The full results for the whole tuning phase are included in Additional file 1.

Harmonising predictions
The predictions of the span and ID classifier are not guaranteed to agree, even if trained jointly. Disagreement occurs if the span classifier predicts a relevant tag (B, I, E, S) for a particular token while the ID classifier predicts NIL, or, conversely, if the ID classifier predicts a specific concept for a token tagged as irrelevant (O) by the span classifier. In addition, the dictionary feature of the knowledge-based entity recogniser might or might not agree with the neural predictions. This results in 2 × 2 × 2 = 8 prediction patterns concerning the relevance of a given token.
We considered four different strategies for harmonising conflicting predictions: spansonly, ids-only, spans-first, and ids-first (cf. Fig. 3). These strategies are heuristics with a predetermined bias towards one of the two classifiers. Two additional strategies (mutual and override), which use the confidence scores for balancing the classifiers, consistently  produced worse results compared to the simpler bias strategies. The score-based strategies are thus not discussed here; however, we used and described the mutual strategy when participating in the CRAFT shared task [46, p. 188]. The systematic application of different harmonisation strategies is one of the major differences of this work compared to the work presented at the shared-task workshop.
With the spans-only strategy, the ID predictions are completely ignored. In order to provide a concept label, the span predictions are combined with the dictionary feature provided by OGER; in case of multiple features, an arbitrary decision is taken (lexically lowest ID). Since a concept label is always required, span predictions without a supporting feature have to be dropped.
With the ids-only strategy, the predictions are based primarily on the ID predictions, whereas the span predictions are overridden (e. g. the span tag cannot be O when the ID classifier predicts a non-NIL concept). The dictionary feature is ignored in the decision.
The spans-first and ids-first strategies are combinations of the previous two. With the former, the spans-only strategy is applied first, backing off to the ids-only strategy if the outcome is O-NIL. Analogously, the ids-first strategy gives preference to ids-only. An example with partially disagreeing predictions is given in Figure 4.
We compared the effect of the different strategies in a 6-fold cross-validation over the training set. For each annotation set, we determined the best harmonisation strategy based on F-score according to the official evaluation suite. As shown in Table 2, using both span and ID predictions was beneficiary most of the time. In many cases, the same strategy worked best for the proper and extended classes. Intuitively, the choice of spans-only for proteins makes sense, as PR[_EXT] shows an exceedingly high number of different concepts with a small overlap between training and test data, which is a tough scenario for the ID tagger. Conversely, entity types with a limited number of distinct concepts in the corpus like sequences and organisms rely more heavily on the ID tagger. The choice of harmonisation strategy was fixed as a hyperparameter for the testset predictions. Fig. 4 Predictions for PR on a short phrase, harmonised with the ids-first strategy. Using the spans-only or spans-first strategy would yield the same result in this example, since the ID and span predictions are identical for "Hexokinase I"

Results and discussion
We evaluated our concept-recognition system using the official evaluation suite [41]. Performance is measured in terms of F-score, i. e. the harmonic mean of precision and recall, and slot error rate (SER) [47]. Both metrics are based on the counts of matches (true positives), substitutions (partial errors), insertions (false positives), and deletions (false negatives). Partially correct predictions are assigned a similarity score m in the range [0, 1], which measures the accurateness of the predicted spans and concept labels [48]. The similarity score incorporates a notion of textual overlap (Jaccard index at the character level) and a weighted measure of shared ancestors in the ontology hierarchy, as introduced in [49]. The fractional value m is added to the match count, whereas the remainder 1 − m is counted as a substitution. While precision, recall, and F-score are figures of merit ranging from 0 (worst) to 1 (best), SER is a measure of error that assigns 0 to a perfect system and higher values to lower performance. Even though the values for SER and F-score often correlate, they are not guaranteed to produce identical rankings. In particular, SER is more sensitive to false-positive errors than F-score, and low precision has a stronger impact on SER than low recall. Please note that perfect scores cannot be reached by our systems due to limitations in the input representation, as explained in the Data preparation section.
The results for our parallel NER+NEN system are given in Table 3. The scores are compared to our systems developed for the shared task [46] and to the official baseline published in the workshop overview [32]. Our system consistently achieves better scores than the baseline, which is a pipeline with a CRF-based span tagger and a BiLSTM-based concept classifier that were also trained on the CRAFT corpus alone. For most annotation sets, our current system performed better than the best system presented in the shared-task paper, with the exception of GO_MF_EXT and PR_EXT. For NCBITaxon_ EXT and PR, the comparison is inconclusive, as SER and F-score give contradictory rankings.
Unfortunately comparison with other systems is difficult due to the fact that the complete CRAFT corpus was not available before the shared task. Previous published results on the CRAFT corpus (such as [50]) are based on a different (and smaller) version of the corpus.

Effect of harmonisation
In order to measure the effect of the different harmonisation strategies, we evaluated all four strategies on the test set, as shown in Fig. 5. This study also serves as a validation for our hyperparameter-tuning approach, i. e. whether cross-validation on the training set can be used for reliably picking the best-suited harmonisation strategy. For the majority of the annotation sets, the picked strategy also worked best for the test set. Where the picked strategy was not the best (GO_MF_EXT, MOP[_EXT]), the difference to the topperforming strategy was comparatively small.

Unseen concepts
As stated above, a major limitation of trained sequence labeling for IDs is the inability to predict concepts not seen among the training examples. An important goal of combining the ID tagger with a span tagger and dictionary-based predictions is to overcome this limitation. To study the effect of the different harmonisation strategies on unseen concepts, we performed another evaluation on a subset of the annotations. To this end, we filtered both ground truth and predictions of the test set to contain only annotations with concept labels that are not used in the training set. Table 4 shows precision and recall scores as well as annotation counts for the subset of unseen concepts. The ids-only strategy is omitted in the table, as this configuration can never predict unseen concepts. The spans-only and spans-first strategies systematically yield identical results, as they only differ in cases where the latter backs off to ID predictions, which have been filtered out in this evaluation. With the ids-first strategy, many span predictions for unseen concepts are shadowed by an ID prediction for a concept known from the training set (which is then ignored in this specific evaluation). For some annotation types (e. g. CHEBI[_EXT], GO_BP[_EXT], SO[_EXT]), the removal of known concepts improves precision, i. e. more false positives than true positives were removed. In other cases, precision suffers from the removal. Recall decreases in all cases, as is to be expected for an evaluation that focuses on more difficult examples.

Interpretation
Tackling concept recognition for multiple entity types with a single architecture is very challenging, even if a separate model is trained for every annotation set. The comparative results for the different harmonisation strategies ( Figure 5) illustrate well how some annotation sets profit more from the span tagger (blue, left-most bars), others more from the ID tagger (red, right-most bars). In many cases, merging predictions from the two taggers (middle bars) yields better results than relying on a single tagger (outer bars). This preference does not directly correlate with ontology size: the two annotation sets with the largest ontologies (NCBITaxon and PR) show quite distinct result patterns. However, it is possible to empirically determine how well each harmonisation strategy suits the characteristics of a given annotation set. Using cross-validation over the training set resulted in robust estimations for ranking the harmonisation strategies.
The diversity of the individual annotation sets shows even more clearly when it comes to predicting unseen concepts. In general, the level of precision and recall for unseen concepts varies greatly across annotation sets, as does the number of unseen concepts in the reference (cf. Table 4). There is a loose negative correlation to the performance on the entire test set: annotation sets like NCBITaxon[_EXT] and SO[_ EXT] show high overall scores and low scores for unseen concepts, whereas more Table 3 Results for our current BioBERT system, best system reported in the shared-task paper [46], and the official baseline In case of the shared-task systems, the results were selected independently for SER and F-score, i. e. the two scores for a given annotation set do not necessarily come from the same system. For the baseline and the current BioBERT system, however, only one system was evaluated for each annotation set A possible explanation is that the former annotation sets have little variability and a high overlap between training and test set, leading to a strong bias for known concepts (overfitting tendency), which is beneficiary for the test set as a whole, but not for the subset of unseen concepts. The latter annotation sets show great variability of concept labels and surface names in the training data, which makes the task harder but also leads to better generalisation, as the classifier cannot achieve good performance by only learning a few frequent concepts.

Error analysis
We performed an analysis of prediction errors in order to find potential weaknesses or systematic mistakes. As expected, many errors are false negatives due to missing training examples. There are several cases where spelled-out mentions are matched, whereas their abbreviated versions are missed. For example, "olfactory tubercle" is correctly linked thanks to the dictionary-based predictions, while the ad-hoc acronym "OT" is missed. False positive predictions are also frequently seen among abbreviations, which have an increased likelihood of being ambiguous. For example, the short-hand "NF" denotes either "neurofilament" or "nuclear factor" in the training set, which cannot always be correctly distinguished by the classifier.

Table 4 Precision and recall for unseen concepts in the test set
For each annotation set, the number of annotations (ref. count) in the test set is given, counting both occurrences (occ.) and unique labels (unique). A dash for precision and recall means that the corresponding system did not predict any unseen concept at all (neither true nor false positive) At first sight, it seems like abbreviation expansion should be able to alleviate errors like these. Replacing short forms with their corresponding long forms increases chances for a dictionary match and, since it is performed within document scope, potentially reduces ambiguity. However, abbreviation expansion is not guaranteed to work perfectly and can be a source of confusion even if it does. For example, "OT" was correctly expanded to "olfactory tubercle". Unfortunately, this misguided the classifier to label the term as olfactory bulb, as the first token was only used for this concept in the training data. In our experiments, the net effect of abbreviation expansion was negative, as stated above in the Hyperparameter tuning section.
Sometimes, spurious predictions are caused by a substring shared with a training example. Since the WordPiece tokeniser used in (Bio)BERT cuts unknown words into sub-word segments, the classifier sometimes associates a concept label with the fraction of a word, which might trigger false positives in unexpected contexts. As an extreme example, mentions of "PDGFR", "PFK", "PKD", "PI3K", and "PFKD" are erroneously linked to phosphoglycerate kinase (abbreviated "PGK"). This is most likely due to the shared initial letter, as the terms do not refer to semantically similar concepts (even though PFK and PI3K are also kinases). Similarly, "forkhead" is linked to fork, "polymorphonuclear" is linked to nucleus and "prosensory" is linked to forebrain (after the synonym "prosencephalon" seen in training data).
In some cases, the chosen harmonisation strategy prefers an erroneous label over a correct one. For example, the term "monkey" is linked to mouse by the ID tagger due to context (training: "mouse kidney", test: "monkey kidney"). Since the NCBITaxon systems are harmonised with the ids-first strategy, this erroneous prediction overrides the correct annotation from the dictionary-based tagger. Conversely, the dictionary predictions for "insulin" always link to PR:000009054, a specific protein. In the ground truth, however, the more general concept PR:000045358 is used throughout the corpus, which denotes a family of proteins. Even though the ID tagger produces correct labels, the spans-first strategy used for PR gives precedence to the dictionary predictions in these cases.
Another interesting category of errors are the ones that were amended through the system improvements, i. e. spurious and missing annotations from the shared-task system that are correctly predicted by the current system. A frequent case are short spans by the shared-task system, such as "Ephrin" instead of "Ephrin-B1" for PR or "X" instead of "X-Gal" for CHEBI, which are now correctly recognised. Another re-occurring pattern are incorrect IDs, such as "benzodiazepine" linked to CHEBI:16150 (benzoate) rather than CHEBI:22720 (correct ID by the current system). Furthermore, coverage of frequent terms has improved, for example the shared-task system found "Staphylococcus Aureus" in some context but missed it in others which were correctly identified by the current system.

Conclusions
In this work, we present a concept-recognition architecture for parallel NER and NEN. Compared to a sequential NER+NEN pipeline, our approach avoids error propagation from the span-detection to the normalisation step. Modeling NEN as a sequence-labeling task allows it to operate directly on running text, at the cost of restricting the label set of the normaliser to the concepts from the training set. We counter these limitations by fusing its predictions with the output of a span detector and a knowledge-based concept recogniser.
In the CRAFT shared task and in the current study, we have shown that parallel concept recognition can outperform a pipeline system created specifically for the CRAFT corpus. Merging the predictions of a span and an ID tagger is a fruitful way of combining the complementary strengths of both of them. However, the specifics of interpolating between span and ID predictions is subject to further research. We took an empiric approach to pick the best harmonisation strategy for each annotation set.
For future work, we intend to test our approach on other datasets. Even though the CRAFT corpus allows validating systems on a broad range of entity types, there is only little opportunity for direct comparison to competing approaches at the time of writing -to the best of our knowledge, there are no published results for the latest version (Version 4) of CRAFT besides the shared task.