Exploring deep learning methods for recognizing rare diseases and their clinical manifestations from texts

Although rare diseases are characterized by low prevalence, approximately 400 million people are affected by a rare disease. The early and accurate diagnosis of these conditions is a major challenge for general practitioners, who do not have enough knowledge to identify them. In addition to this, rare diseases usually show a wide variety of manifestations, which might make the diagnosis even more difficult. A delayed diagnosis can negatively affect the patient’s life. Therefore, there is an urgent need to increase the scientific and medical knowledge about rare diseases. Natural Language Processing (NLP) and Deep Learning can help to extract relevant information about rare diseases to facilitate their diagnosis and treatments. The paper explores several deep learning techniques such as Bidirectional Long Short Term Memory (BiLSTM) networks or deep contextualized word representations based on Bidirectional Encoder Representations from Transformers (BERT) to recognize rare diseases and their clinical manifestations (signs and symptoms). BioBERT, a domain-specific language representation based on BERT and trained on biomedical corpora, obtains the best results with an F1 of 85.2% for rare diseases. Since many signs are usually described by complex noun phrases that involve the use of use of overlapped, nested and discontinuous entities, the model provides lower results with an F1 of 57.2%. While our results are promising, there is still much room for improvement, especially with respect to the identification of clinical manifestations (signs and symptoms).

This figure shows some annotated sentences in the RareDis corpus. Sentence (a) shows an example of two overlapping entities: sign and diseases. It also has a 'Produces' relationships between a rare diseases and a sign. Sentence (b1) contains an example of nested name entities belonging to different entity types: symptom and rare disease. b2 is a mention of rare diseases, which is muti-token. Sentence c contains several discontinuous mentions of signs "obsessive-compulsive disorder", "obsessive compulsive disorder", "anancastic neurosis", and "OCD" are the same disease. Moreover, disease names usually contain modifiers that can be related to body parts or degrees of disease (e.g., "periodic limb movement disorder" or "advanced sleep phase syndrome"). The recognition of symptoms and signs also presents additional challenges. Many symptoms and signs can be described by technical terms (e.g., "dysuria"), but also by short phrases (such as "pain or discomfort when you urinate"). Furthermore, other NER challenges such as overlapping, nested and discontinuous entities have received limited attention [19].
The recent advancements of deep learning models have facilitated great progress in NLP. Recently, transformers [20] and Bidirectional Encoder Representations from Transformers [21] have outperformed traditional and deep learning models for most of NLP applications [22][23][24][25], and in particular, for NER in the biomedical domain [17,26].
We briefly describe the most recent deep learning approaches for recognizing diseases in biomedical texts. One of the first studies that applied deep learning to this task is described in [12]. The authors proposed a hybrid system composed of two modules: a Conditional Random Field (CRF) [27] trained with orthographic, morphological, and domain features from Unified Medical Language System (UMLS) [28], and a bidirectional recurrent neural network (RNN) initialized with domain-specific word embeddings. Finally, a Support Vector Machine (SVM) classifier is used to combine the outputs of the two previous modules. For the training and testing of the system, the authors used the dataset of the Disease Named Entity Recognition and Normalization (DNER) shared task [29] of the BioCreative V challenge, which consists of 1500 PubMed abstracts and a total of 12,850 disease mentions. CRF achieves better results (F1=82.88%) than the bidirectional RNN (F1=78.27%). The output fusion by SVM obtains the best performance with an F1 of 84.28%.
In the last years, Bidirectional Long Short Term Memory (BiLSTM) [30] with CRF has proved to be the most successful model for the task of biomedical NER [13,31,32]. The approach proposed by Habibi et al. [13] was one of the first works to exploit pre-trained word embeddings to initialize a BiLSTM+CRF network for recognizing diseases. The authors used two pre-trained embedding models created by Pyysalo et al. [33]. The first model (from now on called PubMed-PMC) was trained using a collection of texts formed by all abstracts from PubMed (more than 23 million abstracts) and all full articles from PMC (a database of open access with more than 700,000 full articles from the biomedical domain). The second embedding model (from now on called Wiki-PubMed-PMC) was an extension of the first one by adding approximately four million English articles from Wikipedia. These models were trained using the word2vec tool [34]. The authors also trained a word embedding model by using a collection of 20,000 European patents. To train and evaluate their models, they use the NCBI corpus [35] and the CDR corpus [36]). The NCBI corpus is a collection of 793 PubMed abstracts and contains a total of 6892 disease mentions. The CDR corpus contains 1500 MEDLINE abstracts annotated with 5818 diseases, 4409 chemicals, and 3116 chemical-disease interactions. The experiments showed that the network initialized with Wiki-PubMed-PMC obtains better performance (with an F1 of 90.4% over the NCBI dataset and 88.17% over the CDR dataset) than those initialized with the other pre-trained models. This may be because the Wiki-PubMed-PMC model was trained on a larger collection of texts than the other pre-trained models. Moreover, this collection contained domain-specific and nonspecific texts.
The SBLC model [14], is also based on a BiLSTM network with a CRF layer. To represent the text, the authors trained a word embedding model by using a large collection of texts collected from PubMed, PMC, and Wikipedia, with a total of 5.5 billion words. The SBLC was trained and tested on the NCBI dataset, obtaining an F1 of 86.2%.
Instead of using RNN, Zhao et al. [15] used a deep convolutional neural network (CNN). In addition to word embeddings, the authors also exploited character embeddings and lexicon feature embeddings to represent the texts. The character embeddings were generated by using a CNN layer. The MEDIC vocabulary [37], composed of more than 67,000 disease mentions, was used to create the lexicon feature embeddings. After the embedding layer, where each word is represented by concatenating its three embeddings, several CNN layers are applied to obtain higher level features. Then, instead of a CRF classifier, a multiple label strategy (MLS) is applied to capture the labels of the context words. This strategy uses a softmax function to obtain the probability of each possible label. The system obtained an F1 of 85.17% on the NCBI corpus, and an F1 of 87.83% on the CDR corpus.
Ling et al. [16] also used an architecture composed of a BiLSTM with a CRF layer. This architecture was initialized by using the three type of embeddings proposed by Zhao et al. [15], as just described above. The main difference is that these authors applied a combination of a CNN and a LSTM to generate the character embeddings, instead of using a CNN network. The final model achieved an F1 of 83.8% on the NCBI dataset.
One of the main drawbacks of the pre-trained word embeddings models is that they only provide a vector for each word, so they do not handle polysemous words. Recently, contextualized word representation models (such as ELMo [38], GPT-2 [39] or BERT [21]) have emerged as an alternative to the non-contextual word embedding models, providing a different vector for each sense of a word. Lee and colleagues [17] applied BERT to the task of disease recognition on the NCBI dataset, achieving an F1 of 88.60%. The authors also trained their language representation model (BioBERT) on two large biomedical corpora such as PubMed and PMC. BioBERT slightly overcomes BERT on the NCBI dataset, with an improvement of 0.62%.
Li et al. [18] also trained a BERT model using 1.5 million electronic health record notes. This model was evaluated on the NCBI and CDR datasets, showing an F1 of 89.92% and 93.82% respectively.
Very few research efforts have focused on the extraction of rare diseases. The RDD corpus [40] contains 1000 MedLine abstracts covering 578 rare diseases and 3678 annotations expressing a disability. The authors analyzed a model based on Bi-LSTM and CRF to extract rare diseases and disabilities, achieving an F1 of 70.1% for rare diseases and 81% for disabilities.
In this paper, we address the task of recognizing rare diseases as well as their clinical manifestations (symptoms and signs). Moreover, to the best of our knowledge, this is the first work that explores three BERT-based models to extract rare diseases from texts. In particular, we use the basic BERT model and two models, BioBERT [17], and ClinicalBERT [41], which were trained using biomedical and clinical texts, respectively. In order to provide a comprehensive comparison, we also study several BiLSTM models initialized with different pre-trained word embedding models.

Dataset
We use the RareDis corpus [42], which is a collection of texts from the Rare Disease database (NORD) 1 . These texts were manually annotated with four entity types (diseases, rare diseases, signs, and symptoms). The corpus also includes relations between entities, but they are outside the scope of this work. The corpus has three different splits: training set, validation set, and test set. Table 1 shows the number of the entity types annotated, as well as the number of documents, sentences, and tokens in each split. A more detailed description of the RareDis corpus can be found in [42]. The corpus contains a total of 9318 entities. We can observe that sign and rare disease entity types are the most prevalent, around 41% and 34%, respectively. The disease entity type is the third-largest type, with approximately 17%, while symptom entity type is the most sparse entity type in the three splits.
The corpus is distributed in Brat standoff format [43]. The RareDis corpus and its guidelines are publicly available for the research community 2 .

Approaches
NER is a sequence labeling problem, where the goal of the model is to classify each token of the input sequence into the corresponding category. To define these categories, we must consider the types of entity that we intend to extract. As many entity mentions are multi-token, that is, they are composed of several words, for example, "ACDY5-related dyskinesia" (see Fig. 1), we must use a format that allows us to represent if a token belongs or not to a entity mention. Moreover, if the token does belong to an entity mention, we are interested in knowing if the token appears at the beginning of the mention or if it is an internal one of it. Typically, sequence labeling tasks use some of the variations of the format IOB encoding scheme [44], to represent the tokens. In our case, we represent each token using the standard IOB2 (Inside, Outside, Beginning) format [45], where B-X identifies the first token of an entity mention whose type is X (for example, B-SIGN), I-X identifies the continuation of an entity mention with type X (for example, I-SIGN), and O for other tokens. In this regard, the following nine categories or labels are used: O, B-Disease, I-Disease, B-RareDisease, I-RareDisease, B-Sign, I-Sign, B-Symptom, and I-Symptom. Thus, each of our proposed models should address this NER task as a multi-class classification problem and should produce one of these labels for each token in the input sequence. Figure 2 shows an example where an input sequence is processed by a BiLSTM network, where the last layer produces a label for each token in the input sequence. In this sequence, some tokens such as "outcome" or "primary" were classified with the label 'O' (no entities). We can also see that "KCS2", which is a rare disease mention formed by just one token, should be classified with the label B-RareDisease (we used the short label B-rare in the figure), while the following token, "is" should be labeled with "O". This figure also provides an example of multi-token entity, "short stature". The model should classify its first token as "B-sign", indicating that this token is at the beginning of the mention, while the second one with 'I-sign' , indicating the token is inside of the mention. Now, we describe the different methods used to deal with the task of NER on the Rare-Dise corpus.

Conditional random fields (CRF)
As a baseline method, Conditional Random Fields (CRF) [27] is proposed. This is one of the most successful algorithms for any sequence labeling task such as NER [46,47]. CRF learns the correlations between labels and provides the output sequence of IOB tags with the highest probability. That is, CRF predicts the most likely IOB tag for each token in the input sequence.
To represent each token, we consider three kinds of features: token, lemma, and PoS tag. We use Spacy [48], a very popular NLP library, to parse each input sequence and to obtain these features. For each token, we also select a window of size two. Then, the features (token, lemma and Pos tag) of the tokens belonging to this window form the feature set to represent each token. These features are fed into the CRF classifier, which predicts an IOB tag for each input token. To implement the model, we use the CRF-Suite package [49]. The classifier was trained using both training and validation datasets since we use default hyperparameters. The Limited Memory Algorithm for Bound Constrained Optimization (L-BFGS) is used as the optimization method.

Bidirectional long short-term memory (BiLSTM)
BiLSTM has been successfully applied to the NER task in the biomedical domain [31,50], and in particular, to recognize disease names [12][13][14]16]. This model consists of a forward LSTM (which sequentially processes the input sequence from left to right) and a backward LSTM (which processes the input sequences from right to left). In this way, BiLSTM can learn relevant information from the previous and next context for each input token, effectively increasing the amount of information available to the network [51].
Our architecture consists of several layers (see Fig. 2), which are described below. First, in the input layer, the text is represented as word vectors. Then, these input vectors are passed to the BiLSTM layer described above. The output vector of the BiLSTM layer is the concatenation of the forward LSTM and the backward LSTM. After the BiLSTM layer, we consider two different strategies for the output layer.
The first strategy (see Fig. 2) is using a time-distributed dense (TDD) layer to classify each token by determining its most likely label. This layer applies a dense layer on each times-tep of the the BiLSTM network. In this layer, each conditional probability is assessed independently of the other conditional probabilities.
The second strategy is using a CRF classifier as the last layer (see Fig. 3), which will output the sequence of IOB tags with the maximum probability for the input sequence. The CRF layer takes as input the label probability for each word coming from the output layer of the BiLSTM network. Thus, the context surrounding the label assignment predicted by the BiLSTM model is also added, whereby linear-chain CRF explicitly models dependencies between the labels through a transition matrix with transition scores between all pairs of the labels. This allows to easily learn constraints such as, for example, "I-RAREDISEASE" tag cannot follow an "O" tag. These types of constraints are captured by the CRF layer in a simple way by considering the time step in each token.
Moreover, we explore the effect of input text representation on the performance of BiLSTM. Texts must be encoded as vectors of real numbers to be used as input for machine learning and deep learning models. In the case of neural networks, it is possible to create a random vector for each input token. During the training, the network will adjust these word vectors alongside the other weights of the network. An alternative way is to represent tokens with word vectors (word embeddings) from a pre-trained language model. In the last decade, neural network language models [52,53] have effectively replaced traditional models such as the Bag-Of-Words, achieving state-of-the-art results in many NLP tasks. Several studies have shown that word embeddings trained with neural networks can capture semantic and syntactic between tokens [34], providing thus an accurate meaning representation of the input tokens. The most popular word embeddings models are Word2Vec [34], Glove [54] and fastText [55]. In this work, we study the effect of different pre-trained word embeddings on the BiLSTM performance. In particular, we explore three different models: To implement and train the BiLSTM models, we use the Keras Python API [58] with TensorFlow as the backend. We use an Adam optimizer [59] with a learning rate 0.001 and categorical cross-entropy as a loss function. To avoid overfitting, we use early stopping with the patience of four, meaning that training will finish if the loss function does not improve in four consecutive epochs.

Bidirectional encoder representations from transformers (BERT)
Deep contextualized language models are capable to capture word meanings and their more representative relations with other words. Thanks to this accurate linguistic representation, these models achieved unprecedented results on many NLP tasks [21]. Moreover, contextualized language models are trained through unsupervised learning, requiring only a plain text corpus. Thus, these models can partially alleviate the shortage of large annotated corpora, which are essential for supervised machine learning algorithms. Without a doubt, BERT, which stands for Bidirectional Encoder Representations from Transformers, is the most popular contextualized language model due to its excellent results in many NLP applications [21]. Transformers are based on attention mechanism [20], which attempts to represent each word in a sentence based on the most relevant tokens for that word. Attention mechanisms present two major advantages compared with Recurrent Neural Networks (RNNs): first, these mechanisms can handle long-term dependencies between any two tokens in a sentence, and second, they can enable the parallelization of training.
The basic idea of BERT is that the model is trained to predict words from their contexts in an unsupervised way. This prediction only requires a large collection of texts and some strategy to mask those words to be predicted. This strategy is known as , Masked Language Modeling (MLM). First, we tokenize the texts by using the Bert-Tokenizer class from Transformers library (provided by Hugging Face https:// huggi ngface. co/), which offers implementations and pre-trained model weights for the most popular transformers. This class has its own vocabulary with the mappings between words and their identifiers so it is not necessary to train a tokenizer on the RareDis corpus. Each sentence is tokenized and special tokens, such as CLS and SEP, are added at the beginning and at the end of each tokenized sequence, respectively. The tokens are padded or truncated based on the maximum length (512 tokens) that the BERT-base model can handle. For each token, this class also creates a position embedding that encodes the absolute position of the token in the input sequence. It is also necessary to create an attention mask in order to distinguish which tokens correspond to real words and which ones are padding tokens. Thus, the attention mask is composed of ones (indicating non-padding entries) and zeros (indicating padding entries). The input for BERT is the masked sequence and the sum of the token and position embeddings. Then, BERT should output a vector representation for each token.
The architecture of BERT (which consists of 12 encoder layers for the BERT-base version) can be extended with more layers capable to solve a specific NLP task. This process is known as fine-tuning. To fine-tuning the base model (see Fig. 4), we use the "BertForTokenClassification" class from Transformers library. This class implements a token-level classifier on top of the BERT model. The token-level classifier is a linear layer that takes as input the last hidden state of the sequence and makes predictions at the token level, rather than the sequence level. Figure 4 shows the output produced by this fine-tuning model for the input sequence "The primary outcome of KCS2 is short stature".
The BertForTokenClassification class allows to load different pre-trained models as its base architecture. In this work, we explore the following base architectures: • Bert-base-uncased version of the original BERT proposed in [21]. This version is a stack of 12 encoders, each having 12 attention heads. For each token of the input The total number of parameters is 110 million. The model was trained using two corpora: BookCorpus with around 800 million words and English Wikipedia with around 2500 million words. • BioBERT [17], whose weights were initialized using the BERT weights, and then, the model was pre-trained on two biomedical corpora: PubMed abstracts (4500 million words) and PMC full-text articles (13,500 million words). • ClinicalBERT [41] was trained with more than 2 million clinical notes from the MIMIC-III v1.4 database [60]. Its weights were initialized using the BioBERT weights.

Results
In this section, the results obtained from the different methods are presented. We evaluate them at entity level to know how well our models predict the whole entities (for example, "ACDY5-related dyskinesia"). As complementary information, we also assess our approaches at token level. This evaluation may give us some clues as to why some entities are more difficult to recognize and what kind of tokens are more challenging for the task. All our methods output a BIO tag for each token in the input sequence. These predicted BIO tags can be easily compared to the actual tags in the test dataset by using the sklearn library, which provides us the results at token level. That is, it calculates the scores for each label: O, B-Disease, I-Disease, B-RareDisease, I-RareDisease, B-Sign, I-Sign, B-Symptom, and I-Symptom. To evaluate the methods at entity level, we use the seqeval library [61].
NER approaches are typically evaluated in terms of recall, precision, and F1, which are calculated for each entity type or token type (BIO tags). Recall provides us how many of the predicted entities (or the predicted tokens) are correct. It can be defined as the ratio between the correctly predicted mentions for a given entity type (or token type) and the actual number of mentions of this entity type (or token type) in the test dataset. To obtain, for example, the recall for rare diseases, we have to divide the total number of rare diseases proposed by a model by the total number of rare diseases present in the test dataset. A rare disease mention is correctly identified only if all its tokens have been correctly classified with its corresponding BIO tags. Precision tells us how precise is the model. It can be defined as the ratio between the correctly predicted mentions for a given entity type (or token type) and the total number of predicted mentions by the model for this entity type (or token type). Finally, F1 is the harmonic average of precision and recall, which is a useful metric for unbalanced datasets [62]: As our task is a multi-classification problem, we also calculate micro and macro average scores. In macro averaging, metrics are calculated independently for each entity type (or for each token type), and then, we calculate the unweighted mean of these metrics. For example, the macro-average precision will be the unweighted mean of all precision scores for the entity types (or for the token types). We also compute the weighted macroaverages, in which each entity type (or token type) is weighted by the relative number of its instances in the dataset. In micro averaging, true positives, false positives, and false negatives are computed jointly for all entity types (or for all token types), and then, the metrics are calculated. In macro averaging, all classes are treated equally, while, in micro averaging, the classes with more instances will have more impact on the final performance. As our problem has a large class imbalance (see Table 1), we use micro-average scores to provide an overall comparison of all the approaches proposed in this paper. We start by presenting the micro-average F1 of the approaches (Table 2). We can clearly see that the BERT-based models outperform all the other models. Clearly, the deep contextualized vectors from the BERT-based models provide a better representation for the input texts than those provided by CRF or the pre-trained word embeddings used in BiLSTM. BioBERT obtains better results than BERT and ClinicalBERT. This may happen because this was trained on biomedical scientific articles, whose narrative is similar to that used in the NORD database for describing rare diseases. Regarding the other approaches, although BiLSTM was extended with a CRF layer as the output layer, this architecture does not obtain better results than a simple CRF. A possible reason could be that this deep learning technique requires a larger number of training examples for learning. We now present the results of each approach below. Table 3 shows results achieved by CRF at entity level. CRF achieves a micro-average F1 of 64.8% and a macro-average F1 of 61.9%. As entity types are unbalanced (see Table 1), we also consider the macro-weighted-average F1, which is of 63.8%.  The best results are obtained for rare disease entity type (F1=82.4%), which is the second entity type with the largest number of instances, 5228, in the corpus (see Table 1.. On the contrary, sign entity type shows the lowest F1 (45.5%) value, even though it is the entity type with the largest number of instances, 5230 (see Table 1). Both entity types, rare diseases and signs, have a very close number of instances. This may happen because the sign mentions are usually nominal phrases (for example, "malformations of the nipples", see Fig. 1c), unlike disease, rare disease or symptom names, which are usually a combination of few technical terms (for example, "chronic arthritis" or "ADCY5-related dyskinesia", see Fig. 1). Token-level results are shown in Table 4. As expected, these results coincide with the results for entity-level. Its "Support" column shows the number of instances for each type of token. The number of internal tokens (I-) for diseases or rare diseases is slightly higher than the number of its initial tokens (B-), while the number of internal tokens for signs doubles the number of its initial tokens. In addition, many sign mentions are discontinuous entities, that is, they present gaps in their description. The sentence shown in Fig. 1c contains two signs: "malformations of the nipples" and "malformations of the abdominal wall", being the last one a discontinuous mention. Another possible reason is that many signs can be also considered as diseases (see Fig. 1a). CRF and the other models proposed in this study only provide a label per token. That is, they do not address the task of overlapped entities (see Fig. 1b1). The low performance for signs can be explained by all these reasons.

CRF (baseline)
Both signs and symptoms are clinical manifestations of diseases. A sign is an objective evidence, while a symptom is a subjective experience that can only be identified by the patient. However, contrary to the low results for signs, CRF provides the second-best F1 for symptom type (F1=62.2%), even though its number of instances, 397, is the lowest in the corpus. (see Table 1). A manual review of symptoms and signs mentions in the training dataset shows that most symptoms are described by technical terms (for example, "headache"), while signs usually have lay descriptions (for example, "dark circles under eyes"). It would be necessary to increase the number of symptoms in the RareDis corpus to study whether the difference between the results of both types of entities is maintained.

BiLSTM
All the BiLSTM models (see Table 5 provide significantly lower results than CRF (see Table 3). The decrease in micro-average F1 is more than 20% and 24% in macro-average F1. This may indicate that the training data is too small for using deep learning. As happened with CRF, BiLSTM obtains the best results for rare diseases and worst ones for signs. The results at token-level (see Table 6) are coherent with the results at entity level.
Regarding the effect of pre-trained word embeddings to initialize the network, the BiLSTM with Wiki-Pubmed-PMC provides the best overall results. It also obtains the best results for rare diseases and diseases. This may be because these word embeddings were trained on biomedical texts. BiLSTM with Glove achieves a slightly better F1 for signs than BiLTM with Wiki-Pubmed-PMC. However, BiLSTM with Glove achieves an improvement of almost 6% of F1 for symptoms over BiLSTM with Wiki-Pubmed-PMC.  Although Glove word embeddings were not trained on biomedical texts, they obtain very close results to those obtained with Wiki-Pubmed-PMC. This may be because Glove has the biggest vocabulary size. On the other hand, random initialization shows the worst results. In fact, the model trained with random word vectors was not able to detect any symptom. Table 7 shows the results obtained by the BiLSTM-CRF. In all the BiLSTM-CRF models, the CRF layer helps outperform the same models without using CRF, with improvements around 10-15% over the BiLSTM overall scores. All BiLSTM-CRF models achieve higher average recall scores than CRF, while their average precision scores are negatively  affected. Thus, BiLSTM-CRF models still provide lower overall results than the baseline based on CRF, with a decrease of 6% in micro-average F1.

BiLSTM-CRF
Regarding the pre-trained word embeddings, Wiki-Pubmed-PMC and Glove word embeddings provide better performance than using random initialization or GoogleNews word embeddings. BiLSTM-CRF with Glove provides the best results for rare diseases and signs, while Wiki-Pubmed-PMC provides the best F1 for diseases. Entity-level and token-level (see Table 8) results show the same behavior. The model trained with Wiki-Pubmed-PMC or Glove word embeddings achieve the best F1 scores for all token types, except for the B-Symptom and I-Symptom. For these tokens, the best F1 scores are provided by the model trained with random initialization. However, due to the lowest number of instances of this entity type, it is very difficult to give an explanation. It would be necessary to increase its number of instances to know the real behavior of the model for this entity type.
As mentioned previously, BiLSTM fails to beat the baseline, not even when it includes a CRF classifier as its last layer. This may be because the training data size is not enough to train a deep learning model, while a CRF classifier trained with a simple feature set can deal with the task.

BERT-based models
We have explored the use of three different deep contextualized word representations, all of them based on BERT (see Table 9). Unlike the BiLSTM models, these BERT-based models exceed the baseline results provided by a simple CRF classifier.
BioBERT achieves the best micro-average and macro-weighted average F1, while the best macro-average F1 is provided by ClinicalBERT. In general, BioBERT and Clinical-BERT show very close results. As happened with the previous models, rare diseases show the best results, followed by diseases. BioBERT obtains the best F1 for rare diseases and for signs, while ClinicalBERT BERT provides the best results for diseases and symptoms. As expected, the BERT base model, which was trained on BookCorpus and English Wikipedia, obtains lower results than BioBERT and ClinicalBERT.
Regarding the results at the token-level (see Table 10), BioBERT achieves the best F1 scores for all token types, except for B-Symptom and I-Symptom tokens. In these tokens, the best model is ClinicalBERT. Comparing to the previous approaches, all BERT-based models achieve significant improvements on recall. For example, BioBERT largely outperforms CRF, with an increase of 17 points in recall for diseases (see Table 3). The BERT-base also shows significant improvement on recall for rare diseases and signs, with differences of 5 and 14 points, respectively, comparing to the previous best model BiLSTM-CRF. Similarly, ClinicalBert has an improvement of 10 points for recall over recall provided by BiLSTM-CRF model. This significant improvement on recall compared to the previous method may be due to Given text, WordPiece first pre-tokenizes the text into words (by splitting on punctuation and whitespaces) and then tokenizes each word into subword units, called wordpieces

Discussion
Although rare diseases have a very low prevalence in the population, approximately 6% of the world's population suffer a rare disease. This number is continually growing as five new rare diseases are discovered each week [63].
In this paper, we study several methods for recognizing rare diseases and their clinical manifestations. We propose a CRF baseline system using linguistic features. Second, we implement multiple BiLSTMs, testing different classifiers at the output layer such as softmax or CRF, as well as exploring different strategies to initialize their input vectors, such as random initialization and three pre-trained word embedding models, one of them was trained on biomedical texts. Moreover, we explore three implementations of BERT, which differ between them by the type of texts used to pre-train the model. The RareDis corpus is used to train the models and evaluate them. The experiments show that BioBERT obtains the best micro and macro-weighted-average F1, with improvements around 5% over the baseline (CRF) results. BiLSTM does not even outperform the baseline in terms of F1.
Regarding the entity types, the best model, BioBERT, provides the highest F1 (85.2%) for rare diseases, followed by diseases with a F1 of 60.7%. Rare disease names are usually more complex than disease names since many rare disease names often contain disease names (for example, "central diabetes insipidus"). Therefore, the difference between these results may be that the number of rare diseases mentions is twice the number of diseases mentions in the dataset (see Table 1). The other entity types, sign and symptom, do not outperform 60% in F1. One possible reason for the low results to recognize symptoms (F1=58%) may be the insufficient number of training instances for this entity type (see Table 1). On the other hand, although sign is the majority class (see Table 1), this shows the lowest F1 (57.2%). In this case, the most probable reason is that many signs are usually described by complex noun phrases that often involve the use of overlapped, nested and discontinuous entities.
As mentioned before, the BERT-base models provide significant improvements on recall scores for all the entity types and all the token types, compared to the previous approaches. This large improvement may be explained by the wordpiece tokenization used in BERT, while the previous approaches used word-based tokenizers to process the input texts. The wordpiece tokenizer first splits the text into tokens, and then the rare words are broken into smaller meaningful words (named wordpieces) [64]. For example, as "chronic" is a very common word, it is not split into smaller subwords. In contrast, "arthritis" can be considered a rare word because it is less common than "chronic". For this reason, the tokenizer breaks it into three wordpieces: "art", "##hr", "##itis". Wordpiece tokenization has multiples advantages. First, it provides a representation for unknown words by splitting them into known smaller tokens. Moreover, the model is able to learn meaningful representations with a reasonable vocabulary size. These advantages have been also mentioned in several previous works [65][66][67].
Most previous work has focused on recognizing disease names. So far, a model based on transformers [18] has achieved the state-of-the-art results with an F1 of 89.92% on the NCBI corpus, and 93.82% on the CDR corpus. Our results are not directly comparable to previous work, because this is the first work that addresses the detection of rare diseases and their clinical manifestations from the RareDis corpus. Even so, we can see that our model based on BioBERT provides an F1 of 85.25% for the recognizing of rare diseases, which is close to the state-of-the-art performance on the NCBI corpus (F1=89.92%)). We should note that the NCBI corpus contains 6982 disease mentions, while our RareDis corpus only has 5228 rare disease mentions (3608 instances in the training subset). That is, it is reasonable to think that if the RareDis corpus had a similar number of instances to the NCBI corpus, our model could have achieved close results to the state-of-the-art results on the NCBI corpus [18]. Similarly, the size of the CDR corpus is also greater than the RareDis corpus. Moreover, until now, only one study addressed the task of recognizing rare diseases and their disabilities from texts, by using a BiLSTM with a CRF layer. This approach achieved an F1 of 70.1% for rare diseases on the RDD corpus [40]. Therefore, our approach using BioBERT obtains a better performance.
As future work, we plan to extend the size of the RareDis corpus by including Med-Line abstracts and clinical cases of rare diseases. This will increase the number of instances for all entity types, especially the number of instances for symptoms. It could have a significant positive effect on the results, especially those achieved by the deep learning models. We also plan to extend the corpus with texts written in other languages than English. Thus, we will study the behavior of our models on other type of texts.
We will also address some unsolved problems in NER such as the recognition of nested, overlapped and discontinuous entities, which could improve the results for signs. Regarding the models, we will study on fine-tuning the BERT-based models by adding different techniques, such as a simple CRF or more complex architectures. Furthermore, we plan to address the task of relation extraction on the RareDis corpus.