Benchmarking for biomedical natural language processing tasks with a domain specific ALBERT

Background The abundance of biomedical text data coupled with advances in natural language processing (NLP) is resulting in novel biomedical NLP (BioNLP) applications. These NLP applications, or tasks, are reliant on the availability of domain-specific language models (LMs) that are trained on a massive amount of data. Most of the existing domain-specific LMs adopted bidirectional encoder representations from transformers (BERT) architecture which has limitations, and their generalizability is unproven as there is an absence of baseline results among common BioNLP tasks. Results We present 8 variants of BioALBERT, a domain-specific adaptation of a lite bidirectional encoder representations from transformers (ALBERT), trained on biomedical (PubMed and PubMed Central) and clinical (MIMIC-III) corpora and fine-tuned for 6 different tasks across 20 benchmark datasets. Experiments show that a large variant of BioALBERT trained on PubMed outperforms the state-of-the-art on named-entity recognition (+ 11.09% BLURB score improvement), relation extraction (+ 0.80% BLURB score), sentence similarity (+ 1.05% BLURB score), document classification (+ 0.62% F1-score), and question answering (+ 2.83% BLURB score). It represents a new state-of-the-art in 5 out of 6 benchmark BioNLP tasks. Conclusions The large variant of BioALBERT trained on PubMed achieved a higher BLURB score than previous state-of-the-art models on 5 of the 6 benchmark BioNLP tasks. Depending on the task, 5 different variants of BioALBERT outperformed previous state-of-the-art models on 17 of the 20 benchmark datasets, showing that our model is robust and generalizable in the common BioNLP tasks. We have made BioALBERT freely available which will help the BioNLP community avoid computational cost of training and establish a new set of baselines for future efforts across a broad range of BioNLP tasks.


Background & Summary
The growing volume of the published biomedical literature, such as clinical reports 1 and health literacy 2 demands more precise and generalized biomedical natural language processing (BioNLP) tools for information extraction.The recent advancement of using deep learning (DL) in natural language processing (NLP) has fueled the advancements in the development of pre-trained language models (LMs) that can be applied to a range of tasks in the BioNLP domains 3 .
However, directly fine-tuning of the state-of-the-art (SOTA) pre-trained LMs for bioNLP tasks, like Embeddings from Language Models (ELMo) 4 , Bidirectional Encoder Representations from Transformers (BERT) 5 and A Lite Bidirectional Encoder Representations from Transformers (ALBERT) 6 , yielded poor performances because these LMs were trained on general domain corpus (e.g.Wikipedia, Bookcorpus etc.), and were not designed for the requirements of biomedical documents that comprise of different word distribution, and having complex relationship 7 .To overcome this limitation, BioNLP researchers have trained LMs on biomedical and clinical corpus and proved its effectiveness on various downstream tasks in BioNLP tasks [8][9][10][11][12][13][14][15] .
Jin et al. 9 trained biomedical ELMo (BioELMo) with PubMed abstracts and found features extracted by BioELMo contained entity-type and relational information relevant to the biomedical corpus.Beltagy et al. 11 trained BERT on scientific texts and published the trained model as Scientific BERT (SciBERT).Similarly, Si et al. 10 used task-specific models and enhanced traditional non-contextual and contextual word embedding methods for biomedical named-entity-recognition (NER) by training BERT on clinical notes corpora.Peng et al. 12 presented a BLUE (Biomedical Language Understanding Evaluation) benchmark by designing 5 tasks with 10 datasets for analysing natural biomedical LMs.They also showed that BERT models pre-trained on PubMed abstracts and clinical notes outperformed other models which were trained on general corpora.The most popular biomedical pre-trained LMs is BioBERT (BERT for Biomedical Text Mining) 13 which was trained on PubMed and PubMed Central (PMC) corpus and fine-tuned on 3 BioNLP tasks including NER, Relation Extraction (RE) and Question Answering (QA).Gu et al. 14 developed PubMedBERT by training from scratch on PubMed articles and showed performance gained over models trained on general corpora.They developed a domain-specific vocabulary from PubMed articles and demonstrated a boost in performance on the domain-specific task.Another biomedical pre-trained LM is KeBioLM 15 which leveraged knowledge from the UMLS (Unified Medical Language System) bases.KeBioLM was applied to 2 BioNLP tasks.Table 1 summarises a number of datasets previously used to evaluate Pre-trained LMs on various BioNLP tasks.Our previous preliminary work has shown the potential of designing a customised domain-specific LM outperforming SOTA in NER tasks 16 .
With all these pre-trained LMs adopting BERT architecture, its' training is slow and requires huge computational resources.NLPBA 19 × × LINNAEUS 20 × × × × NCBI (Disease) 21 × Species-800 (S800) 22 × × × × BC2GM 23 × × 27 × × Furthermore, these LMs are trained on limited domain-specific corpora, whereas some tasks contain both clinical and biomedical terms, so training with broader coverage of domain-specific corpora can improve performance.ALBERT has been shown to be a superior model compared to BERT in NLP tasks 6 , and we suggest that this model can be trained to improve BioNLP tasks as shown with BERT.In this study, we hypothesize that training ALBERT on biomedical (PubMed and PMC) and clinical notes (MIMIC-III) corpora can be more effective and computationally efficient in BioNLP tasks as compared to other SOTA methods.We present biomedical ALBERT (BioALBERT), a new LM designed and optimized to benchmark performance on a range of BioNLP tasks.BioALBERT is based on ALBERT and trained on a large corpus of biomedical and clinical texts.We fined-tuned and compared the performance of BioALBERT on 6 BioNLP tasks with 20 biomedical and clinical benchmark datasets with different sizes and complexity.Compared with most existing BioNLP LMs that are mainly focused on limited tasks, our BioALBERT achieved SOTA performance on 5 out of 6 BioNLP tasks in 17 out of 20 tested datasets.BioALBERT achieved higher performance in NER, RE, Sentence similarity, Document classification and a higher Accuracy (lenient) score in QA than the current SOTA LMs.To facilitate developments in the important BioNLP community, we make the pre-trained BioALBERT LMs and the source code for fine-tuning BioALBERT publicly available 1 .

Methods
BioALBERT has the same architecture as ALBERT.The overview of pre-training, fine-tuning, variants of tasks and datasets used for BioNLP is shown in Figure 1.We describe ALBERT and then the pre-training and fine-tuning process employed in BioALBERT.

ALBERT
ALBERT 6 is built on the architecture of BERT to mitigate a large number of parameters in BERT, which causes model degradation, memory issues and degraded pre-training time.ALBERT is a contextualized LM that is based on a masked language model (MLM) and pre-trained using bidirectional transformers 33 like BERT.ALBERT uses an MLM that predicts randomly masked words in a sequence and can be used to learn bidirectional representations.
ALBERT is trained on the same English Wikipedia and BooksCorpus as in BERT; however, reduced BERT parameters by 87 percent and could be trained nearly twice as fast.ALBERT reduced parameters requirement by factorizing and decomposing

Pre-training BioALBERT
We first initialized BioALBERT with weights from ALBERT during the training phase.Biomedical terminologies have terms that could mean different things depending upon its context of appearance.For example, ER could be referred to 'estrogen receptor' gene or its product as protein.Similarly, RA may represent 'right atrium' or 'rheumatoid arthritis' depending upon the context of appearance.On the other hand, two terminologies could be used to refer to a similar concept, such as 'heart attack' or 'myocardial infarction'.As a result, pre-trained LM trained on general corpus often obtains poor performance.BioALBERT is the first domain-specific LM trained on biomedical domain corpus and clinical notes.The text corpora used for pre-training of BioALBERT are listed in Table 2. BioALBERT is trained on abstracts from PubMed, full-text articles of PMC and clinical notes (MIMIC) and their combination.These unstructured and raw corpora were converted to structured format by converting raw text files into a single sentence in which: (i) within a text, all blank lines were removed and transformed into a single paragraph, and (ii) any line with a length of less than 20 characters was excluded.Overall, PubMed contained approximately 4.5 billion words, PMC contains about 13.5 billion words, and MIMIC contained 0.5 billion words.
We used sentence embeddings for tokenization of BioALBERT by pre-processing the data as a sentence text.Each line was considered as a sentence keeping the maximum length to 512 words by trimming.If the sentence was shorter than 512 words, then more words were embedded from the next line.An empty line was used to define a new document.3,125 warm-up steps are used for training all of our models.We used LAMB optimizer during the training process of our models and kept the vocabulary size to 30K.GeLU activation is used in all variants of models during the training process.For BioALBERT base models, a training batch size of 1,024 was used, whereas, in BioALBERT large models, training size was reduced to 256 due to computational resources limitations.Summary of parameters used in the training process is given in Table 3. Fine-tuning BioALBERT Similar to other SOTA biomedical LMs2 , BioALBERT was tested on a number of downstream BioNLP tasks which required minimal architecture alteration.BioALBERT's computational requirements were not significantly large compared to other baseline models, and fine-tuning only required relatively small computation compared to the pre-training.BioALBERT employed less physical memory and improvised on parameter sharing techniques, and learned the word embeddings using the sentence piece tokenization, which gives it better performance and faster training compared to other SOTA biomedical LMs.During fine-tuning, we use the weights of the pre-trained BioALBERT LM.We used an AdamW optimizer and a learning rate of 0.00001.A batch size of 32 was used during training.We restricted sentence length to 512 for NER and 128 for all other tasks and lower-cased all words.Finally, we fine-tuned our pre-trained models for 10k training steps and used 512 warm-up steps during the fine-tuning process.The test datasets were used for prediction, and the evaluation metric was compared with previous SOTA models.Table 5 summaries all fine-tuning parameters.

Tasks and Datasets
We fine-tuned BioALBERT on 6 different BioNLP tasks with 20 datasets that cover a broad range of data quantities and difficulties (Table 6).We rely on pre-existing datasets that are widely supported in the BioNLP community and describe each of these tasks and datasets.• Named entity recognition (NER): Recognition of proper domain-specific nouns in a biomedical corpus is the most basic and important BioNLP task.F1-score was used as an evaluation metric for NER.BioALBERT was evaluated on 8 NER benchmark datasets (From Biomedical and Clinical domain): We used NCBI (Disease) 21 , BC5CDR (Disease) 18 , BC5CDR (Chemical) 18 , BC2GM 23 , JNLPBA 19 , LINNAEUS 20 , Species-800 (S800) 22 and Share/Clefe 17 datasets.
• Relation Extraction (RE): RE tasks aim to identify relationship among entities in a sentence.The annotated data were compared with relationship and types of entities.Micro-average F1-score metric was used as an evaluation metric.For RE, we used DDI 24 , Euadr 26 , GAD 27 , ChemProt 7 and i2b2 25 datasets.
• Document Classification: Document classification tasks classifies the whole document into various categories.Multiple labels from texts are predicted in the multi-label classification task.For the document classification task, we followed the common practice and reported the F1-score.For document classification, we used HoC (the hallmarks of Cancers) 31 dataset.
• Inference: Inference tasks predict whether the premise sentence entails the hypothesis sentence.It mainly focuses on causation relationships between sentences.For evaluation, we used overall standard accuracy as a metric.For inference, we used MedNLI 30 dataset.
• Question Answering (QA): QA is a task of answering questions posed in the natural language given related passages.
We used accuracy as an evaluation metric for the QA task.For QA, we used BioASQ factiod 32 datasets.

Results and Discussion
• Comparison with SOTA biomedical LMs: Table 7 summarizes the results3 for all the BioALBERT variants in comparison to the baselines 4 .We observe that the performance of BioALBERT5 is higher than SOTA models on 17 For RE, BioALBERT outperformed SOTA methods on 3 out of 5 datasets by 1.69%, 0.82%, and 0.46% on DDI, ChemProt and i2b2 datasets, respectively.On average (micro), BioALBERT obtained a higher F1-score (BLURB score) of 0.80% than the SOTA LMs.For Euadr and GAD performance of BioALBERT slightly drops because the splits of data used are different.We used an official split of the data provided by authors, whereas the SOTA method reported 10-fold cross-validation results; typically, having more folds increase results.
For STS, BioALBERT achieved higher performance on both datasets by a 1.05% increase in average Pearson score (BLURB score) as compared to SOTA models.In particular, BioALBERT achieved improvements of 0.50% for BIOSSES and 0.90% for MedSTS.
Similarly, for document classification, BioALBERT slightly increase the performance by 0.62% for the HoC dataset and the inference task (MedNLI dataset), the performance of BioALBERT drops slightly, and we attribute this to the average length of the sentence being smaller compared to others.
For QA, BioALBERT achieved higher performance on all 3 datasets and increased average accuracy (lenient) score (BLURB score) by 2.83% compared to SOTA models.In particular, BioALBERT improves the performance by 1.08% for BioASQ 4b, 2.31% for BioASQ 5b and 5.11% for BioASQ 6b QA datasets respectively as compared to SOTA.
We note that the performance of ALBERT (both base and large), when pre-trained on MIMIC-III, in addition to PubMed and combination of PubMed and PMC, drops as compared to the same pre-trained ALBERT without MIMIC-III, especially in RE, STS and QA tasks.Since MIMIC consists of notes from the ICU of Beth Israel Deaconess Medical Center (BIDMC) only, the data size was relatively smaller than PubMed and PMC, and therefore we suggest that the BioALBERT performance was poor when compared to models trained on PubMed only or PubMed and PMC.BioALBERT (large), trained on PubMed with dup-factor as five, performed better.

ALBERT
In contrast , 15 Gy increased the expression of p27 in radiosensitive tumors and reduced it in radioresistant tumors.BioALBERT In contrast , 15 Gy increased the expression of p27 in radiosensitive tumors and reduced it in radioresistant tumors.

7/10
We compared pre-training run-time statistics of BioALBERT with BioBERT with all variants of BioALBERT outperforming BioBERT.The difference in performance is significant, identifying BioALBERT as a robust and practical model.BioBERT Base trained on PubMed took 10 days, and BioBERT Base trained on PubMed and PMC took 23 days, whereas all models of BioALBERT took less than 5 days for training an equal number of steps.The run time statistics of both pre-trained models are given in Table 8.

Conclusion
We present BioALBERT, the first adaptation of ALBERT trained on both biomedical text and clinical data.Our experiments show that training general domain language models on domain-specific corpora leads to an improvement in performance across a range of biomedical BioNLP tasks.BioALBERT outperforms previous state-of-the-art models on 5 out of 6 benchmark tasks and on 17 out of 20 benchmark datasets.Our expectation is that the release of the BioALBERT models and data will support the development of new applications built from biomedical NLP tasks.

Figure 1 .
Figure 1.An overview of pre-training, fine-tuning and the diverse tasks and datasets present in Benchmarking for BioNLP using BioALBERT

Table 2 .
List of text corpora used for BioALBERT

Table 3 .
Summary of parameters used in the Pre-training of BioALBERT

Table 4 )
consisting of 4 base and 4 large LMs.We identified that on V3-8 TPU, both base and large LMs were successful with a larger batch size during training.The base model contained 128 embedding size and 12 million parameters, whereas the large model had 256 embedding size and 16 million parameters.

Table 4 .
BioALBERT trained on different training steps, different combinations of the text corpora given in Table 2, and BioALBERT model version and size

Table 5 .
Summary of parameters used in fine-tuning

Table 6 .
Statistics of the datasets used

Table 7 .
Comparison of BioALBERT v/s SOTA methods in BioNLP tasks

Table 8 .
Comparison of run-time (in days) statistics of BioALBERT v/s BioBERT.Refer to Table 4 for more details of BioALBERT size.BioBERT Base1 and BioBERT Base2 refers to BioBERT trained on PubMed and PubMed+PMC respectively

Table 9 .
Prediction samples from ALBERT and BioALBERT.Bold entities are better recognised by BioALBERT receptors in lymphocytes and their sensitivity to. . .BioALBERT Number of glucocoticoid receptors in lymphocytes and their sensitivity to. . .leaflets are mildly thickened .There is mild mitral annular calcification .TRICUSPID VALVE. . .BioALBERT The mitral valve leaflets are mildly thickened .There is mild mitral annular calcification .TRICUSPID VALVE. . .