Exploring Deep Learning for recognizing rare diseases and their clinical manifestations from texts

Although rare diseases are characterized by low prevalence, approximately 300 million people are affected by a rare disease. The early and accurate diagnosis of these conditions is a major challenge for general practitioners, who do not have enough knowledge to identify them. In addition to this, rare diseases usually show a wide variety of manifestations, which might make the diagnosis even more difﬁcult. A delayed diagnosis can negatively affect the patient’s life. Therefore, there is an urgent need to increase the scientiﬁc and medical knowledge about rare diseases. Natural Language Processing (NLP) and Deep Learning can help to extract relevant information about rare diseases to facilitate their diagnosis and treatments. The paper proposes several deep learning techniques such as Bidirectional Long Short Term Memory (BiLSTM) networks or deep contextualized word representations based on Bidirectional Encoder Representations from Transformers (BERT) to recognize rare diseases and their clinical manifestations (signs and symptoms) in the RareDis corpus. This corpus contains more than 5,000 rare diseases and almost 6,000 clinical manifestations. BioBERT, a domain-speciﬁc language representation based on BERT and trained on biomedical corpora, obtains the best results. In particular, this model obtains an F1-score of 85.2% for rare diseases, outperforming all the other models.


I. INTRODUCTION
R ARE diseases are characterized by a low prevalence in the population.There is no consensus on the percentage of affected people with a disease to be considered as a rare disease.Thus, whereas in the United States, a rare disease affects fewer than 200,000 people, in Europe, the prevalence of a rare disease is less than 1 person per 2000 [1].To date, there are around 7,000 rare diseases and new rare diseases are identified each week.In spite of their low prevalence, these diseases may affect more than 400 million people around the world [2], [3].
The diagnostic process of rare diseases becomes a very long road for patients and their families to obtain an accurate diagnosis and then receive an adequate treatment.The delay in diagnosis of rare diseases is between six and seven years [4].A possible cause of the delayed diagnosis is the limited experience and knowledge about rare diseases of clinicians [5]- [7].In addition, rare diseases may present a heterogeneous phenotype, with a wide variety of symptoms and signs, related among others with different driving mutations [8].These characteristics are often responsible for inaccuracies in the diagnose of rare diseases.Therefore, there is an urgent need to increase the usability of the sparse and fragmented scientific and medical knowledge about rare diseases [9].
Artificial intelligence, and in particular Natural Language Processing (NLP) and machine learning, can play a beneficial role by providing better access to the relevant information about rare diseases and their clinical manifestations (signs and symptoms), and in this way, helping to alleviate the workload on doctors.Although much of the knowledge about rare diseases is stored in databases and ontologies, biomedical literature (research articles, clinical cases, health forums, social media, etc) is a rich source of information about rare diseases in unstructured text.Information extraction techniques such as Named Entity Recognition (NER) can help structure this information, facilitating access to the knowledge embedded within those texts and boosting scientific research.
The automatic recognition of disease named entities has attracted much attention over the last years [10]- [16], as it can be applied in meaningful clinical applications such as cohort selection for clinical trials or epidemiological studies, pharmacovigilance, personalized medicine, among many others.This task is a very challenging task due to the diversity and complexity of disease names.Many disease names can have different synonyms and abbreviations to represent them.For instance, "obsessive-compulsive disorder", "obsessive compulsive disorder", "anancastic neurosis", and "OCD" are the same disease.Moreover, disease names usually contain modifiers that can be related to body parts or degrees of disease (e.g., "periodic limb movement disorder" or "advanced sleep phase syndrome").The recognition of symptoms and signs also present additional challenges.Many symptoms and sings can be described by technical terms (e.g., "dysuria"), but also by short phrases (such as "pain or discomfort when you urinate").Furthermore, other NER challenges such as overlapping, nested and discontinuous entities have received limited attention [17].
The recent advancements of deep learning models have facilitated great progress in NLP.Recently, Transformers [18] and Bidirectional Encoder Representations from Transformers [19] have outperform traditional and deep learning models for most of NLP applications [20]- [23], and in particular, for NER in the biomedical domain [15], [24].
We briefly describe the most recent deep learning approaches for recognizing diseases in biomedical texts.One of the first studies that applied deep learning to this task is described in [10].The authors proposed a hybrid system composed of two modules: a Conditional Random Field (CRF) [25] trained with orthographic, morphological, and domain features from UMLS, and a bidirectional recurrent neural network (RNN) initialized with domain-specific word embeddings.Finally, a Support Vector Machines (SVM) classifier is used to combine the outputs of the two previous modules.For the training and testing of the system, the authors used the dataset of the Disease Named Entity Recognition and Normalization (DNER) shared task [26] of the BioCreative V challenge, which consists of 1,500 PubMed abstracts and a total of 12,850 disease mentions.CRF achieves better results (F1=82.88%)than the bidirectional RNN (F1=78.27%).The output fusion by SVM obtains the best performance with an F1 of 84.28%.
In the last years, Bidirectional Long Short Term Memory (BiLSTM) [27] with CRF has proved to be the most successful model for the task of biomedical NER [11], [28], [29].The approach proposed by Habibi et al. [11] was one of the first works to exploit pre-trained word embeddings to initialize a BiLSTM+CRF network for recognizing diseases.The authors used two pre-trained embedding models created by Pyysalo et al. [30].The first model (from now on called PubMed-PMC) was trained using a collection of texts formed by all abstracts from PubMed (more than 23 million abstracts) and all full articles from PMC (a database of open access with more than 700,000 full articles from the biomedical domain).The second embedding model (from now on called Wiki-PubMed-PMC) was an extension of the first one by adding approximately four million English articles from Wikipedia.These models were trained using the word2vec tool [31].The authors also trained a word embedding model by using a collection of 20,000 European patents.To train and evaluate their models, they use the NCBI corpus [32] and the CDR corpus [33]).The NCBI corpus is a collection of 793 PubMed abstracts and contains a total of 6,892 disease mentions.The CDR corpus contains 1500 MEDLINE abstracts annotated with diseases, chemicals, and their relations.The experiments showed that the network initialized with Wiki-PubMed-PMC obtains better performance (with an F1 of 90.4% over the NCBI dataset and 88.17% over the CDR dataset) than those initialized with the other pre-trained models.This may be because the Wiki-PubMed-PMC model was trained on a larger collection of texts than the other pre-trained models.Moreover, this collections contained domain-specific and nonspecific texts.
The SBLC model [12], is also based on a BiLSTM network with a CRF layer.To represent the texts, the authors trained a word embedding model by using a large collection of texts collected from PubMed, PMC, and Wikipedia, with a total of 5.5 billion words.The SBLC was trained and tested on the NCBI dataset, obtaining an F1 of 86.2%.
Instead of using RNN, Zhao et al. [13] used a deep convolutional neural network (CNN).In addition to word embeddings, the authors also exploited character embeddings and lexicon feature embeddings to represent the texts.The character embeddings were generated by using a CNN layer.The MEDIC vocabulary [34], composed of more than 67,000 disease mentions, was used to create the lexicon feature embeddings.After the embedding layer, where each word is represented by concatenating its three embeddings, several CNN layers are applied to obtain higher level features.Then, instead of a CRF classifier, a multiple label strategy (MLS) is applied to capture the labels of the context words.This strategy uses a softmax function to obtain the probability of each possible label.The system obtained an F1 of 85.17% on the NCBI corpus, and an F1 of 87.83% on the CDR corpus.
Ling et al. [14] also used an architecture composed of a BiLSTM with a CRF layer.This architecture was initialized by using the three type of embeddings proposed by Zhao et al. [13], as just described above.The main difference is that these authors applied a combination of a CNN and a LSTM to generate the character embeddings, instead of using a CNN network.The final model achieved an F1 of 83.8% on the NCBI dataset.
One of the main drawbacks of the pre-trained word embeddings models is that they only provide a vector for each word, so they do not handle polysemous words.Recently, contextualized word representation models (such as ElMo [35], GPT-2 [36] or BERT [19]) have emerged as an alternative to the non-contextual word embedding models, providing a different vector for each sense of a word.Lee and colleagues [15] applied BERT to the task of disease recognition on the NCBI dataset, achieving an F1 of 88.60%.The authors also trained their language representation model (BioBERT) on two large biomedical corpora such as PubMed and PMC.BioBERT slightly overcomes BERT on the NCBI dataset, with an improvement of 0.62%.
Li et al. [16] also trained a BERT model using 1.5 million electronic health record notes.This model was evaluated on the NCBI and CDR datasets, showing an F1 of 89.92% and 93.82% respectively.
Very few research efforts have focused on the extraction of rare diseases.The RDD corpus [37] contains 1000 MedLine abstracts covering 578 rare diseases and 3,678 annotations expressing a disability.A disability can be defined as "any restriction or lack (resulting from an impairment) of ability to perform an activity in the manner or within the range considered normal for a human being" [38].The authors analyzed a model based on Bi-LSTM and CRF to extract rare diseases and disabilities, achieving an F1-score of 70.1% for rare diseases and 81% for disabilities.
In this paper, we address the task of recognizing rare diseases as well as their clinical manifestations (symptoms and signs).Moreover, to the best of our knowledge, this is the first work that explores three BERT-based models to extract rare diseases from texts.In particular, we use the basic BERT model and two models, BioBERT [15], and ClinicalBERT [39], which were trained using biomedical and clinical texts, respectively.In order to provide a comprehensive comparison, we also study several BiLSTM models initialized with different pre-trained word embedding models.

A. DATASET
We use the RareDis corpus [40], which is a collection of texts from the Rare Disease database (NORD) 1 .These texts were manually annotated with four entity types (diseases, rare diseases, signs, and symptoms).The corpus also includes relations between entities, but they are outside the scope of this work.The corpus has three different splits: training set, validation set, and test set.Table 1 shows the number of the entity types annotated, as well as the number of documents, sentences, and tokens in each set.A more detailed description of the RareDis corpus can be found in [40].The corpus contains a total of 13,772 entities.We can observe that sign and rare disease entity types are the most prevalent in the three datasets, around 41% and 34%, respectively.The disease entity type is the third-largest type, with approximately 17%, while symptom entity type is the most sparse entity type in the three datasets.The corpus is distributed in Brat standoff format [41].We parse the texts using Spacy to obtain their tokens, lemmas, and PoS tags.As NER is a sequence labeling task, we represent each token using the standard IOB2 (Inside, Outside, Beginning) encoding scheme [42], where B-X identifies the first token of an entity mention whose type is X (for example, B-SIGN), I-X identifies the continuation of an entity mention with type X (for example, I-SIGN), and O for other tokens.The RareDis corpus and its guidelines are publicly available for the research community2 .

B. APPROACHES
Now we describe the different methods used to deal with the task of NER on the RareDise corpus.As a baseline system for comparison, we use a CRF, one of the most successful algorithms for any sequence labeling task such as NER [43], [44].Moreover, we explore the use of different deep learning architectures such as BiLSTM and Transformers.
For much of the past decade, RNNs, and in particular BiLSTM, have been successfully applied to a wide range of NLP tasks [35], [45].However, these networks are characterized by some drawbacks such as difficulty to process longer sequences and high computational complexity [46].On the contrary, Transformers are based on attention mechanism [47], which are capable to capture the relevant information of the input sequence and also allows parallelization (it can be executed in parallel for each word in the input sequence, while RNNs have to process word by word).Moreover, transformers have outperformed RNNs, setting the new stateof-the-art performance in many NLP tasks [48].
Thus, we aim to provide a comparative analysis of deep learning models for detecting rare diseases and their clinical manifestations in texts.

1) CRF
CRF learns the correlations between labels and provides the output sequence of IOB tags with the highest probability.As the feature set, we consider three kinds of features: token, lemma, and PoS tag.For each token, we select a window of size two.Then, the features of the tokens belonging to this window are the representation of each token.These features are fed into the CRF classifier, which predicts an IOB tag for each input token.To implement the model, we use the CRFSuite package [49].The classifier was trained using both training and validation datasets since we use default hyperparameters.The Limited Memory Algorithm for Bound Constrained Optimization (L-BFGS) is used as the optimization method.
2) Bidirectional Long short-term memory (BiLSTM) BiLSTM has been successfully applied to the NER task in the biomedical domain [28], [50], and in particular, to recognize disease names [10]- [12], [14].This model consists of a forward LSTM (which sequentially processes the input sequence from left to right) and a backward LSTM (which processes the input sequences from right to left).In this way, BiLSTM can learn relevant information from the previous and next context for each input token, effectively increasing the amount of information available to the network [51].
Our architecture consists of several layers, which are described below.First, in the input layer, the texts are represented as word vectors.Then, these input vectors are passed to the BiLSTM layer described above.The output vector of the BiLSTM layer is the concatenation of the forward LSTM and the backward LSTM.After the BiLSTM layer, we consider two different strategies for the output layer.The first strategy is using a CRF classifier as the last layer, which will output the sequence of IOB tags with the maximum probability for the input sequence.The CRF layer takes as input the label probability for each word coming from the output layer of the BiLSTM model.Thus, the context surrounding the label assignment predicted by the BiLSTM model is also added, whereby linear-chain CRF explicitly models dependencies between the labels through a transition matrix with transition scores between all pairs of the labels.This allows to easily learn constraints such as, for example, "I-RAREDISEASE" tag cannot follow an "O" tag.These types of constraints are captured by the CRF layer in a simple way by considering the time step in each token.As a second strategy, we also evaluate a BiLSTM without CRF layer, where each probability is treated conditionally independent.To do this, instead of using CRF, we employ a TimeDistributedDense layer, very similar to a deep layer that can be applied to every time-step of the BiLSTM layer.
Moreover, we explore the effect of input text representation on the performance of BiLSTM.Texts must be encoded as vectors of real numbers to be used as input for machine learning and deep learning models.In the case of neural networks, it is possible to create a random vector for each input token.During the training, the network will adjust these word vectors alongside the other weights of the network.An alternative way is to represent tokens with word vectors (word embeddings) from a language model.In the last decade, neural network language models [52], [53] have effectively replaced traditional models such as the Bag-Of-Words, achieving state-of-the-art results in many NLP tasks.Several studies have shown that word embeddings trained with neural networks can capture semantic and syntactic between tokens [31], providing thus an accurate meaning representation of the input tokens.The most popular word embeddings models are Word2Vec [31], Glove [54] and fastText [55].In this work, we study the effect of different pre-trained word embeddings on the BiLSTM performance.
In particular, we explore three different models: • GoogleNews [56], a pre-trained word embedding model trained with the Word2Vec network on the GoogleNews dataset.The model contains word embeddings of dimension 300 for 3 million words.To implement and train the models, we use the Keras Python API [58] with TensorFlow as the backend.We use an Adam optimizer [59] with a learning rate 0.001 and categorical cross-entropy as a loss function.To avoid overfitting, we use early stopping with the patience of four, meaning that training will finish if the loss function does not improve in four consecutive epochs.

3) Bidirectional Encoder Representations from Transformers (BERT)
Deep contextualized language models are capable to capture word meanings and their more representative relations with other words.Thanks to this accurate linguistic representation, these models achieved unprecedented results on many NLP tasks [19].Moreover, contextualized language models are trained through unsupervised learning, requiring only a plain text corpus.Thus, these models can partially alleviate the shortage of large annotated corpora, which are essential for supervised machine learning algorithms.
Without a doubt, BERT, which stands for Bidirectional Encoder Representations from Transformers, is the most popular contextualized language model due to its excellent results in many NLP applications [19].Transformers are based on attention mechanism [18], which attempts to represent each word in a sentence based on the most relevant tokens for that word.Attention mechanisms present two major advantages compared with RNN: first, these mechanisms can handle long-term dependencies between any two tokens in a sentence, and second, they can enable the parallelization of training.
The basic idea of BERT is that the model is trained to predict words from their contexts in an unsupervised way.This prediction only requires a large collection of texts and some strategy to mask those words to be predicted.Thus, BERT can learn meaningful representation for the words in a sequence.The architecture of BERT (which consists of 12 encoder layers for the BERT-base version or 24 encoder layers for the BERT-large variation) can be extended with more layers capable to solve a specific NLP task.This process is known as fine-tuning.In our case, we have used the BertForTokenClassification class provided by the PyTorch-Transformers package 3 , which is a library of state-of-the-art pre-trained models for NLP.It provides PyTorch implementations and pre-trained model weights for the most popular deep contextualized language models.The BertForToken-Classification class implements a fine-tuning model that adds a token-level classifier on top of the BERT model.The tokenlevel classifier is a linear layer that takes as input the last hidden state of the sequence.The BertForTokenClassification class allows to load different pre-trained model.In this work, we explore the following ones: • Bert-base-uncased version of the original BERT proposed in [19].This version is a stack of 12 encoders, each having 12 attention heads.For each token of the input sentence, the output layer provides an embedding of dimension 768 for this token.The total number of parameters is 110 million.The model was trained using two corpora: BookCorpus with around 800 million words and English Wikipedia with around 2,500 million words.
• BioBERT [15], whose weights were initialized using the BERT weights, and then, the model was pre-trained on two biomedical corpora: PubMed abstracts (4,500 million words) and PMC full-text articles (13,500 million words).

III. RESULTS AND DISCUSSION
In this section, the results obtained from the different methods are presented.To calculate the evaluation metrics (accuracy, recall, precision and F1-score), the sklearn-crfsuite 4and seqeval [61] packages libraries are used to calculate the results at the token and entity level, respectively.
Table 2 shows results achieved by CRF on entity-level.Token-level results are shown in ??.CRF achieves a microaverage F1-score of 64.8% and a macro-average F1-score of 61.9%.As classes are unbalanced, we also consider the macro-weighted-average F1-score, which is of 63.8%.The best results are obtained for rare disease entity type (F1=82.4%),followed by symptom (F1=62.2%).On the contrary, sign entity type shows the lowest F1-score (45.5%) value, despite being the entity type with the largest number of instances (41%) in the training dataset (see Table 1).This may be because sign mentions are usually nominal phrases (for example, "malformations of the nipples"), unlike disease or rare disease names, which are usually a combination of few technical terms (for example, "ADCY5-related dyskinesia").In Table 6, the "Support" column shows the number of instances for each type of token.The number of internal tokens (I-) for diseases or rare diseases is slightly higher than the number of its initial tokens (B-), while the number of internal tokens for signs doubles the number of its initial tokens.In addition, many sign mentions are discontinuous entities, that is, they present gaps in their description.The sentence shown in Figure 1.c contains two signs: "malformations of the nipples" and "malformations of the abdominal wall", being the last one a discontinuous mention.Another possible reason is that many signs can be also considered as diseases (see Figure 1.a).CRF and the other models proposed in this study only provide a label per token.That is, they do not address the task of overlapped entities.Both signs and symptoms are clinical manifestations of diseases.A sign is an objective evidence, while a symptom is a subjective experience that can only be identified by the patient.However, contrary to the low results for signs, CRF provides the second-best F1-score for symptom type, which has the lowest number of instances (see Table 1).A manual review of symptoms and signs mentions in the training dataset shows that most symptoms are described by technical terms (for example, "headache"), while signs usually have lay descriptions (for example, "dark circles under eyes").It would be necessary to increase the number of symptoms in the RareDis corpus to study whether the difference between the results of both types of entities is maintained.

B. BILSTM
All the BiLSTM models provide significantly lower results than CRF (see Tables 2 and 3).The decrease in micro-average F1-score is more than 20% and 24% in macro-average F1score.This may indicate that the training data is too small for using deep learning.As happened with CRF, BiLSTM obtains the best results for rare diseases and worst ones for signs.
Regarding the effect of pre-trained word embeddings to initialize the network, the BiLSTM with Wiki-Pubmed-PMC provides the best overall results.It also obtains the best results for rare diseases and diseases.This may be because these word embeddings were trained on biomedical texts.BiLSTM with Glove achieves a slightly better F1-score for signs than BiLTM with Wiki-Pubmed-PMC.However, BiL-STM with Glove achieves an improvement of almost 6% of F1-score for symptoms over BiLSTM with Wiki-Pubmed-PMC.Although Glove word embeddings were not trained on biomedical texts, they obtain very close results to those obtained with Wiki-Pubmed-PMC.This may be because Glove has the biggest vocabulary size.On the other hand, random initialization and GoogleNews word embeddings provide lower results.

C. BILSTM-CRF
Table 4 shows the results obtained by the BiLSTM-CRF.In all the BiLSTM-CRF models, the CRF layer helps outperform the same models without using CRF, with improvements around 10-15% over the BiLSTM overall scores.However, BiLSTM-CRF models still provide lower overall results than the baseline based on CRF, with a decrease of 6% in micro-average F1-score.
The BiLSTM-CRF with Wiki-Pubmed-PMC word embeddings achieves the best overall results.Moreover, this model also provides the best F1-scores for diseases and symptoms, while the BiLSTM-CRF with Glove provides the best results for rare diseases and signs.The BiLSTM-CRF initialized with random vectors or GoogleNews word embeddings have show similar results.They do not outperform the BiLSTM+CRF with Wiki-Pubmed-PMC or Glove.As mentioned previously, BiLSTM fails to beat the baseline, not even when it includes a CRF classifier as its last layer.This may be because the training data size is not enough to train a deep learning model, while a CRF classifier trained with a simple feature set can deal with the task.Regarding the pre-trained word embeddings, Wiki-Pubmed-PMC and Glove word embeddings provide better performance than using random initialization or GoogleNews word embeddings.

D. BERT-BASED MODELS
We have explored the use of three different deep contextualized word representations, all of them based on BERT  BioBERT achieves the best micro-average and macroweighted average F1-scores, while the best macro-average F1-score is provided by ClinicalBERT.In general, BioBERT and ClinicalBERT show very close results.As happened with the previous models, rare diseases show the best results, followed by diseases.BioBERT obtains the best F1-score for rare diseases and for signs, while ClinicalBERT BERT provides the best results for diseases and symptoms.As expected, the BERT base model obtains lower results than BioBERT and ClinicalBERT.

IV. CONCLUSION
Although rare diseases have a very low prevalence in the population, more than 400 million people worldwide (around 6% of the world's population) suffer a rare disease.This number is continually growing as five new rare diseases are discovered each week [62].
This work explores different approaches for recognizing rare diseases and their clinical manifestations.We propose a CRF baseline system using linguistic features.Second, we implement several BiLSTMs, exploring different strategies to initialize their input vectors, such as random initialization and three pre-trained word embedding models, one of them was trained on biomedical texts.Moreover, we explore three implementations of BERT, which differ between them by the type of texts used to pre-train the model.The RareDis corpus is used to train the models and evaluate them.The experiments show that BioBERT obtains the best micro and macroweighted-average F1-score, with improvements around 5% over the baseline results.BiLSTM does not even outperform the baseline in terms of F1-score.Regarding the entity types, rare diseases show the highest F1-score (85.2%), while the other entity types do not outperform 60% in F1-score.
As future work, we plan to extend the size of the RareDis corpus by including MedLine abstracts and clinical cases of rare diseases.This could have a significant positive effect on the results, especially those achieved by the deep learning models.Moreover, we could know if the difference between symptoms and signs is due to their representations or the number of their instances to train the model.We also plan to extend the corpus with texts written in other languages than English.We will also address some unsolved problems in NER such as the recognition of nested, overlapped and discontinuous entities.
Regarding the methods, we will study on fine-tuning the BERT-based models by adding CRF to improve the results for signs and symptoms.Furthermore, we plan to address the task of relation extraction on the RareDis corpus. .

TOKEN-LEVEL RESULTS
This appendix contains the token-level results of the proposed models in this study.

FIGURE 1 .
FIGURE 1. Examples of entity types in the RareDis corpus.

TABLE 1 .
Statistics of the RareDis corpus.

TABLE 2 .
Entity-level results of CRF.Best scores are in bold.

TABLE 3 .
Entity-level results of BiLSTM models.Best scores are in bold.

TABLE 4 .
Entity-level results of BiLSTM-CRF models.Best scores are in bold.
(see Table5).Unlike the BiLSTM models, these BERT-based models exceed the baseline results provided by a simple CRF classifier.

TABLE 5 .
Entity-level results of the BERT-based models.Best scores are in bold.