Analyzing transfer learning impact in biomedical cross-lingual named entity recognition and normalization

Rivera-Zavala, Renzo M.; Martínez, Paloma

doi:10.1186/s12859-021-04247-9

Volume 22 Supplement 1

Recent Progresses with BioNLP Open Shared Tasks - Part 2

Research
Open access
Published: 17 December 2021

Analyzing transfer learning impact in biomedical cross-lingual named entity recognition and normalization

BMC Bioinformatics volume 22, Article number: 601 (2021) Cite this article

2226 Accesses
2 Citations
7 Altmetric
Metrics details

Abstract

Background

The volume of biomedical literature and clinical data is growing at an exponential rate. Therefore, efficient access to data described in unstructured biomedical texts is a crucial task for the biomedical industry and research. Named Entity Recognition (NER) is the first step for information and knowledge acquisition when we deal with unstructured texts. Recent NER approaches use contextualized word representations as input for a downstream classification task. However, distributed word vectors (embeddings) are very limited in Spanish and even more for the biomedical domain.

Methods

In this work, we develop several biomedical Spanish word representations, and we introduce two Deep Learning approaches for pharmaceutical, chemical, and other biomedical entities recognition in Spanish clinical case texts and biomedical texts, one based on a Bi-STM-CRF model and the other on a BERT-based architecture.

Results

Several Spanish biomedical embeddigns together with the two deep learning models were evaluated on the PharmaCoNER and CORD-19 datasets. The PharmaCoNER dataset is composed of a set of Spanish clinical cases annotated with drugs, chemical compounds and pharmacological substances; our extended Bi-LSTM-CRF model obtains an F-score of 85.24% on entity identification and classification and the BERT model obtains an F-score of 88.80% . For the entity normalization task, the extended Bi-LSTM-CRF model achieves an F-score of 72.85% and the BERT model achieves 79.97%. The CORD-19 dataset consists of scholarly articles written in English annotated with biomedical concepts such as disorder, species, chemical or drugs, gene and protein, enzyme and anatomy. Bi-LSTM-CRF model and BERT model obtain an F-measure of 78.23% and 78.86% on entity identification and classification, respectively on the CORD-19 dataset.

Conclusion

These results prove that deep learning models with in-domain knowledge learned from large-scale datasets highly improve named entity recognition performance. Moreover, contextualized representations help to understand complexities and ambiguity inherent to biomedical texts. Embeddings based on word, concepts, senses, etc. other than those for English are required to improve NER tasks in other languages.

Background

Efficient information extraction off biomedical data described in scientific articles, clinical narrative, or e-health reports is a growing interest in biomedical industry, research, and so forth. In this context, improved biomedical name mentions identification in the biomedical texts is a crucial step downstream tasks such as drug and protein interactions, chemical compounds, adverse drug reactions, among others. Named Entity Recognition (NER) is one of the fundamental tasks of biomedical text processing, intending to automatically extract and identify mentions of entities of interest in running text, typically through their mention boundary or by classifying tokens to match specific entity mentions. Traditionally, there are three phases in recognizing concepts in texts: (1) to identify the limits of the term or phrase that represents the concept in the text (char offsets in the text), (2) to classify the term or phrase on a class (for instance, drug, disease, body part, etc.) and (3) to normalize the concept by assigning it an identifier in a specific domain resource such as UMLS [1]. The existing biomedical NER methods can be classified into: dictionary-based methods, which are based on the use of existing domain knowledge dictionaries limited by its size, spelling errors, the use of synonyms, and the constant growth of vocabulary. Rule-based methods and Machine Learning methods usually depend on the engineering of syntactic and semantic features as well as specific language and domain features that are learned from large collections of text or built from scratch. More recently, deep learning approaches have emerged due to the availability of myriad data from different sources (scientific literature, social media, clinical texts, etc.).

The NER task has been accomplished by three types of methods. Dictionary-based methods require having specific resources integrating terminology such METAMAP tool [2] that includes UMLS [1] and recognizes mentions of medical concepts. With the availability of annotated corpora, machine learning supervised approaches have widely used in entity recognition. One of the most effective methods is Conditional Random Fields (CRF) [3] since CRF is one of the most reliable sequence labeling methods. Different challenges have been held to foster research in NER, for example, eHealth CLEF, SEMEVAL and TAC, among others. In the special case of drugs, DDIExtraction 2011 [4] and DDIExtraction 2013 [5] were specifically designed to recognize pharmacological entities and drug-drug interactions (DDI) in Medline abstracts and DrugBank technical records both in English. In these shared tasks the best result reported for NER using four types of pharmacological substances (generic drug names, branded drug names, drug group names and active substances not approved for human use) was F1 of 71.5% (by a system based on CRF algorithm). For DDI identification and classification in four classes (advice, mechanism, effect and int) the best result was 65.1% (system bases on a combination of kernels). Most of the participating systems were built on support vector machines (SVM). In general, approaches based on non-linear kernels methods achieved better results than linear SVMs.

More recently, deep learning methods started to obtain better results in NER based on the use of pre-trained models (word embeddings) obtained from a huge volume of unlabelled texts (scientific literature, social media texts, Wikipedia, among others). Word embeddings have been evolving from static representations that do not model the dynamic nature of words to contextualized representations that allow word embeddings to adapt to the context it appears, see [6] for a detailed description of embeddings. Pre-trained models may be useful for analyzing texts if these texts are similar to what they were trained on. When texts are from a different domain we will need to fine-tune a pre-trained model to fit our data or task. This is much more efficient than training a whole model from scratch because it is too time and resources consuming task. With a limited set of examples systems can get high performance in downstream tasks. See [7] for a survey of embeddings in clinical natural language processing. The new challenge, PharmacoNER 2019 [8] was focused on recognizing and normalizing pharmacological substances in Spanish clinical cases. In the current stream of deep learning approaches, participating systems mostly included those architectures. The baseline defined for PharmacoNER was based on vocabulary transfer using a LSTM model with Glove embeddings trained from SBWC and the medical word embeddings for Spanish [9] that achieved a high F1 of 0.82% (ranked 16 out of 22) in NER. The first ranked system [10] was based on a pipeline composed of a BERT (Bidirectional Encoder Representations from Transformers) for NER and a Bi-LSTM for concept indexing achieving an F1-score of 91.5% on NER and 83.9% on concept indexing. The third-ranked system [11] was based on a Bi-LSTM-CRF tagger with FLAIR contextualized embeddings obtaining a result of 89.76% F1-score using pre-trained embeddings and up to 90.5% using specialized ones. The second-ranked system [12] implemented a traditional knowledge-based approach based on dictionaries, particularly the SNOMED-CT medical ontology [13] together with a set of 104 contextual regexp patterns to tackle ambiguity (an important issue especially for abbreviations) and surprisingly this system obtained an F1-Score of 91% in NER and 91.6% in concept indexing (top system). This reveals that resource-based approaches have a lot to say yet. Other deep learning works have also demonstrated state-of-the-art performance for English [14,15,16] texts by automatically learning relevant patterns from corpora, which allows language and domain independence. Weber [17] described a set of experiments with a NER tool called HUNER that incorporates a fully trained LSTM-CRF model using 34 different corpora for five entity types that outperform the state-of-the-art tools CnormPlus and tmChem by 5–13 pp for chemicals, species and gene on CRAFT corpus [18]. However, concerning the generation of domain-based pre-trained models until now, to the best of our knowledge, there is only one work that addresses the generation of Spanish biomedical word embeddings [9, 19].

In this paper, we propose two deep learning approaches to face the recognition of pharmacological and chemical entities in Spanish texts. The approaches are evaluated using the Spanish biomedical PharmaCoNER and English biomedical CORD-19 datasets. Our main goal is to evaluate the performance impact of cross-domain (general and biomedical domain) and cross-language (Spanish and English) pre-trained embeddings models. Firstly, for entity identification and classification, we implemented two bidirectional Long Short Memory (Bi-LSTM) layers with a CRF layer based on the NeuroNER model proposed in [20]. Specifically, we have extended the NeuroNER [20] architecture by adding context information to token-level representation, such as Part-of-Speech (PoS) tags and overlapping or nested entities. Moreover, in this work, we use several pre-trained word embedding models: (i) a word2vec model (Spanish Billion Word Embeddings [21]), which was trained on the 2014 dump of Wikipedia, (ii) pre-trained word2vec model of word embeddings trained with PubMed and PMC articles, (iii) Scielo and Wikipedia cased pre-trained model based on the FastText implementation, (iv) a sense-disambiguation embedding model [22], where different word senses are represented with different sense vectors and trained from scratch embedding models (v) the FastText-SBC model trained on the FastText implementation and (vi) the SNOMED-SBC model based on the FastText-SBC replacing concepts with their unique SNOMED-CT [13] identifier. Finally, we implemented the Bidirectional Encoder Representations for Transformers (BERT) model with fine-tuning using BERT pre-trained general domain models and a trained from scratch biomedical model. For concept indexing based on the output of offset recognition and entity classification, we applied a full-text search and a fuzzy matching approach on the SNOMED-CT Spanish Edition dictionary to obtain the corresponding index to normalize the concept.

Results

We evaluate our deep learning models using the train, validation and test datasets provided by the task organizers of the PharmaCoNER Shared Task [8]. The PharmaCoNER task considers two subtasks. Subtask 1 considers offset recognition and entity classification of pharmacological substances, compounds, and proteins. Subtask 2 considers concept indexing where for each entity, the list of unique SNOMED concept identifiers must be generated. We apply the standard measures precision, recall and F1-score to evaluate the performance of our approaches. These metrics are also used in the PharmaCoNER task. A detailed description of the evaluation can be found in [23].

Moreover, we evaluate our deep learning models on the train, validation and test subsets of the CORD-19 dataset [24]. F-measure is used as the primary metric where true positives are entities that match with the gold standard annotations boundaries and entity type.

Offset detection and entity classification

The NER task is addressed as a sequence labeling task. For NER we tested different configurations with various pre-trained word representation models.

Bi-LSTM CRF model: extended NeuroNER

For our Bi-LSTM CRF model we test various pre-trained and trained from the scratch word embeddings models (see Table 21). Table 1 describes our different experiment configurations for the PharmaCoNER datasets with Spanish general domain (W2V-SBWC and FastText-SBWC), English general domain (FastText 2M), Spanish biomedical domain (FastText-SBC and SNOMED-SBC) and English biomedical domain (PubdMed and PMC) embeddings. Each configuration for all evaluations was executed up to 5 times and we kept the best result obtained (85.75) as shown in Table 2. Table 2 compares the different results obtained in 5 runs for Extended NeuroNER using FastText-SBC + Reddit embedding models.

Table 1 System hyperparameters for each PharmaCoNER run

Recent Progresses with BioNLP Open Shared Tasks - Part 2

Analyzing transfer learning impact in biomedical cross-lingual named entity recognition and normalization

Abstract

Background

Methods

Results

Conclusion

Background

Results

Offset detection and entity classification

Bi-LSTM CRF model: extended NeuroNER

Multi-layer bidirectional transformer encoder: BERT

Concept indexing

Discussion

Conclusions

Methods

Named entity recognition

Corpora

Bi-LSTM CRF model: extended NeuroNER

Multi-layer bidirectional transformer encoder: BERT

Datasets

Availability of data and materials

Abbreviations

References

Acknowledgements

About This Supplement

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Consent to publish

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us