Multiple-level biomedical event trigger recognition with transfer learning

Background Automatic extraction of biomedical events from literature is an important task in the understanding biological systems, allowing for faster update of the latest discoveries automatically. Detecting trigger words which indicate events is a critical step in the process of event extraction, because following steps depend on the recognized triggers. The task in this study is to identify event triggers from the literature across multiple levels of biological organization. In order to achieve high performances, the machine learning based approaches, such as neural networks, must be trained on a dataset with plentiful annotations. However, annotations might be difficult to obtain on the multiple levels, and annotated resources have so far mainly focused on the relations and processes at the molecular level. In this work, we aim to apply transfer learning for multiple-level trigger recognition, in which a source dataset with sufficient annotations on the molecular level is utilized to improve performance on a target domain with insufficient annotations and more trigger types. Results We propose a generalized cross-domain neural network transfer learning architecture and approach, which can share as much knowledge as possible between the source and target domains, especially when their label sets overlap. In the experiments, MLEE corpus is used to train and test the proposed model to recognize the multiple-level triggers as a target dataset. Two different corpora having the varying degrees of overlapping labels with MLEE from the BioNLP’09 and BioNLP’11 Shared Tasks are used as source datasets, respectively. Regardless of the degree of overlap, our proposed approach achieves recognition improvement. Moreover, its performance exceeds previously reported results of other leading systems on the same MLEE corpus. Conclusions The proposed transfer learning method can further improve the performance compared with the traditional method, when the labels of the source and target datasets overlap. The most essential reason is that our approach has changed the way parameters are shared. The vertical sharing replaces the horizontal sharing, which brings more sharable parameters. Hence, these more shared parameters between networks improve the performance and generalization of the model on the target domain effectively.

involved in some phosphorylation processes, the "regulation" and "phosphorylation" events come into being. Event extraction task usually contains two main steps: identifying the event triggers and then identifying the event arguments according to the triggers [6]. Event trigger recognition, aiming at detecting those expressions from text that indicate certain events, is the first and crucial step of event extraction. Event extraction performance depends entirely on the recognized triggers. This point was clearly shown by Björne et al. [7]. They found that between using the gold standard and predicted triggers, the performance declined by more than 20 points. Many Machine Learning (ML) based methods, including Conditional Random Field (CRF) [8,9], Support Vector Machine (SVM) [7,[10][11][12][13], and Deep Neural Network (DNN) [14][15][16] models have been successfully applied to event trigger recognition.
These machine learning based approaches rely on large quantity and high quality annotated training data. Their performance may deteriorate when certain training instances are insufficient. However, acquiring manually annotated datasets is both time consuming and costly. Up to now, the manual annotations of biological events mainly focus on genes and proteins. In the corpora of the Shared Tasks of BioNLP'09, 9 types of frequently used biomolecular events are annotated. Biomolecular events involving proteins and genes are an important part of the picture of biological systems, but still only a small part. Hence, in order to obtain a more comprehensive understanding of biological systems, the scope of event extraction has been broadened from molecular-level reactions to cellular-, tissue-and organ-level effects, and to organism-level outcomes [17]. It is not trivial to keep up to date with the annotations of the expanding event types across multiple levels. For example, in the MLEE corpus [10] multiple levels of events from the molecular level to the whole organism have been annotated. The number of event types has been extended to 19. But at the same time, the number of annotated instances for each event type has been greatly reduced. Thus, it will be useful that the annotated dataset from a related domain (such as biomolecular event annotations from the BioNLP'09 corpus) can help to alleviate the shortage of training data problem in the target domain (such as multiple-level event recognition from the MLEE corpus). Recently, transfer learning (TL) techniques have been proposed to address this need [18].
The concept of transfer learning comes from the observed fact that when learning in a new related domain, humans can usually benefit from what they have learned before [19]. This idea has been employed in data mining and machine learning fields [20][21][22] as a transfer learning schema. Pan and Yang [18] define transfer learning as using some knowledge learned from a source dataset to perform a task on a target dataset. And, transfer learning has been successfully applied to many fields, including text mining [23,24].
Here, we focus on the research of transfer learning for DNNs, due to their successful application in many text mining tasks over the last few years. Ideally, transfer learning can achieve higher performance by reducing the amount of annotated data needed, and improving generalization of the model on the target dataset. Normally, in the setting of TM and Natural Language Processing (NLP), according to the difference between the source and target datasets, transfer learning approaches of DNN models have three common categories: cross-lingual transfer, cross-domain transfer and cross-task transfer. Due to different languages, cross-lingual transfer is mostly limited to the use of additional language resources to transfer knowledge [25,26] between the source and target datasets. It cannot extend to our biomedical event trigger recognition applications across multiple levels.
Sharing the same language, both cross-domain and cross-task transfer learning modes can take advantage of more relevance between source and target datasets. In these two modes, parameters of DNN models are used to transfer knowledge between source and target datasets. Some parameters of one model learned from a source dataset can be converted to initialize some parameters of another related model for optimizing on a target dataset. Usually, how many parameters can be shared depends on the degree of the relevance of the source and target datasets. Yang [27] examined the effects of transfer learning for deep hierarchical recurrent networks on several different sequence labelling tasks, including the crossdomain, cross-task and cross-lingual transfer learning models. And it was reported that significant improvement can be obtained. In the case of cross-domain transfer, the datasets of two domains are consistent when their label sets are identical or mappable to each other. Otherwise, the datasets of two domains are inconsistent. If the two domains are consistent, they can share the parameters of all the layers between the source and target DNN models. But, if they are inconsistent, the parameter sharing is restricted to the fewer layers of the DNN models. The cross-task transfer can be simply considered as the case of the cross-domain transfer using inconsistent label sets due to the fact that different tasks do not share the same tags. Hence, the same parameter sharing strategy is effective for them [27]. In the work of Meftah [28], both cross-task and cross-domain (with inconsistent source and target tags) transfer learning was implemented to address the problem of the need in annotated data of social media texts. And the validity and genericity of the models were demonstrated on the Part-Of-Speech (POS) tagging tasks. More studies on transfer learning have been successfully performed in the NLP sequence labelling tasks. Dong [29] proposed a multichannel DNN model to transfer knowledge cross-domain in Chinese social media. In order to ensure the consistency of the source and target domains, some tags are merged in their paper. The experiments showed that the model achieved the best advanced performance. Lee [24] used cross-domain transfer learning for Named Entity Recognition (NER) with consistent tags, showing that transfer learning improved upon the state-of-the-art results on a target dataset with a small number of instances. Giorgi [30] demonstrated that transferring a DNN model significantly improved the latest leading results for biomedical NER, when the source and target domains are consistent.
Our aim in this study is to transfer the trigger recognition knowledge from the source molecular level domain to the target multiple-level domain. This can be seen as an exploratory step towards the more effective automatic extraction of targets from a complex and multifarious domain based on an available simple and singular domain. This situation often occurs in certain fields when research is extended from a familiar area to an unfamiliar and broader area. For instance, after the 9 types of molecular level event relationships between genes and proteins from the biomedical literature have been studies, the research focus will shift to other levels, and the event types will be expanded. The source and target domains, event triggers from different levels, are highly related. Under this circumstance, their label sets may overlap more or less. Nevertheless the annotations from the source and target domains are inconsistent, because their label sets are not identical and mappable. However, among all the above transfer learning studies, there is no model designed to solve how to share network parameters in the case of overlapping label sets. They just simplify the problem to the case of having different label sets between the source and target domains.
We present a new generalized transfer learning approach based on a DNN model, which attempts to share the knowledge to the extent possible between the related source and target domains. The transfer learning approach is modified and generalized to share more network parameters to improve trigger recognition performance across multiple levels on the target domain. Our approach mainly addresses transfer learning between the domains with overlapping label sets. In this paper, a source domain with plentiful annotations of biomolecular event triggers (the BioNLP corpus) is used to improve performance on a target domain of multiple-level event triggers with fewer available annotations (the MLEE corpus). To our knowledge, no reported research has applied transfer learning to make the best use of overlapping label sets to find the shared knowledge.
The rest of this paper is organized as follows. In "Methods" section, detailed descriptions of the proposed generalized transfer learning method and Multiple-Level Trigger recogNizer (MLTrigNer) system are provided. "Results" section describes the used biomedical corpora, experimental settings, and all the experimental results. And this is followed by the in-depth analysis in "Discussion" section. We present the conclusions and future work in "Conclusions" section.

Corpus description
An in-depth investigation is carried out to compare the performance of our proposed Multiple-Level event Trigger recogNizer, MLTrigNer, which is built based on the generalized cross-domain transfer learning BiLSTM-CRF model. The dataset Data MLEE is used as the target domain dataset. With varying degrees of label overlapping, Data ST09 and Data EPI11 are used as the source domain datasets, respectively. Named entity and trigger types annotated in these corpora are illustrated in Table 1.
In the trigger types of Data MLEE , the labels overlapped with Data ST09 are marked using '*' , and the labels overlapped with Data EPI11 are marked using '+' . We can see that Data MLEE and Data ST09 are highly related because of the nine overlapping trigger labels. However, there are some overlapping labels that have gone beyond the molecular level in Data MLEE , which annotate events across multiple levels. For example, "Localization" is the event type extracted from both cells and biomolecules in Data MLEE . Data MLEE and Data EPI11 are loosely related with only two overlapping trigger labels. More details of these datasets are introduced in the following.

Data MLEE
The MLEE corpus [10] is used to train and test our MLTrigNer on multiple-level trigger word identification as a target dataset. The corpus is taken from 262 PubMed abstracts focusing on tissue-level and organ-level processes, which are highly related to certain organism-level pathologies. In Data MLEE , 19 event types are chosen from the GENIA ontology, which can be classified into four groups: anatomical, molecular, general and planned. Our task is to identify the correct trigger type of each event. Hence, there are 20 tags in the target label set, including a negative one. All the statistics in the training, development and test sets are shown in Table 2.

Data ST09
This corpus is taken from the Shared Task (ST) of BioNLP challenge 2009 [4] and contains training and development sets, including 950 abstracts from PubMed. It is used to train our MLTrigNer as a source dataset. In this corpus, 9 event types are chosen from the GENIA ontology involving molecular-level entities and processes, which can be categorized into 3 different groups: simple events, binding events and regulation events. The training  and development sets are combined as a source domain dataset Data ST09 . All of the detailed statistics of Data ST09 are shown in Table 3.

Data EPI11
This corpus is taken from the Epigenetics and Posttranslational Modifications (EPI) task of BioNLP challenge 2011 [5] and contains training and development sets, including 800 abstracts relating primarily to protein modifications drawn from PubMed. It is also used to train our MLTrigNer as a source dataset. In this corpus, 14 protein entity modification event types and their catalysis are chosen. Hence there are 15 event types totally. The training and development sets are combined as a source domain dataset Data EPI11 . All of the detailed statistics in Data EPI11 are shown in Table 4. The number of annotated events in Data EPI11 is less than that in the Data ST09 , annotating the more event types.

Performance assessment
We measure the performance of the trigger recognition system in terms of the F1 measure. The F1 is determined by a combination of precision and recall. Precision is the ratio of the number of correctly classified triggers within a category to the total number of recognized ones. Recall is the ratio of the number of correctly classified triggers within a category to the total number of triggers. They are defined as follows: where TP is the number of the triggers that are correctly classified to a category, FP is the number of the triggers that are misclassified to a category, and FN is the number of the triggers misclassified to other categories.

Implementation details
All of the experiments described in the following are implemented using the Tensorflow library [31]. Hyperparameters are tuned using the training and development sets through cross-validation and then the final model  dimensions. Then, the BiLSTM layer with a hidden state dimension of 300, and the fully-connected layer with 600 dimensions. In order to avoid overfitting, dropout with a probability 0.5 is used before the input to the BiLSTM and fully-connected layers.

Transfer learning performance
The effectiveness of our proposed is approach illustrated based on the performance comparison of the three neural network models described in "Methods" section. First, the Basic Model A ( Fig. 1) is trained only on the training and development sets of Data MLEE (without transfer learning) as a baseline measurement, and its results are shown in the second column of Table 5. Then, Data ST09 is used as the source dataset in the transfer learning models. The TL Model C (Fig. 2) and the MLTrigNer model ( Fig. 3) are jointly trained on Data ST09 and the training and development sets of the target dataset Data MLEE using different transfer learning approaches, respectively. The three models are tested on the test set of Data MLEE . The results are shown in the third and forth columns of Table 5. Among the models described in "Methods" section, the TL Model B (Fig. 4) cannot be used in the trigger recognition task since the domain-dependent input feature sets are employed, which are inconsistent in the source and target domains. From the results of the Basic Models A and the TL Model C, we can see that the transfer learning improves the F1 measure 1.76%. Generalizing the transfer learning schema in the MLTrigNer Model improves the trigger recognition performance a further 1.78%. This improvement is due to the fact that in our approach, more parameters are transferred from the source network to the target one than usual, signifying more effective knowledge sharing. It is worth noting there are improvements in both precision and recall, which refers to the ability of the MLTrigNer to identify more positive triggers. Higher precision and recall signify identification of more potential biomedical events during the subsequent processing phase, which is important for the ultimate event extraction application. Compared with the TL Model C, beside "Negative regulation" and "Localization", the F1 values of the other trigger types overlapping with the source dataset are improved. Among these overlapping labels, some of them have gone beyond the molecular level in Data MLEE to annotate events across multiple levels. Moreover, the F1 values of the 7 non-overlapping trigger types are also improved, except for "Growth", "Dephosphorylation" and "Planned process". Hence, our proposed approach can improve the recognition performance across multiple levels through transfer more knowledge from a single level domain.
Then, Data EPI11 is used as the source dataset alternatively. Basic Model A (Fig. 1) was also trained only   on the training and development sets of Data MLEE (without transfer learning) as a baseline measurement, and its results are shown in the second column of Table 6. The TL Model C (Fig. 2) and the MLTrigNer Model (Fig. 3) are then jointly trained on the source dataset Data EPI11 and the training and development sets of the target dataset Data MLEE using different transfer learning approaches. The results are shown in the third and forth columns of Table 6, respectively. The three models are tested on the test set of Data MLEE . From the results of the Basic Model A and the TL Model C, we can see that the transfer learning improves the F1 measure 0.87%. The MLTrigNer Model improves the performance a further 1.04%, and the improvements are also both in precision and recall. Using Data EPI11 as the source dataset, the MLTrigNer Model brings less performance improvement. This is due to the decreased correlation between the source and target domains. In the transfer learning models, less parameters can be transferred from the source to the target networks.
However, our MLTrigNer Model still can improve the performance further compared with the basic transfer learning approach. Hence, our proposed method is effective when the overlapping is more or less. Compared with the TL Model C, the recognition performance of the overlapping trigger "Phosphorylation" is not improved, and its F1 measure is 100.0 in both models, which cannot be improved further. Moreover, the performance of the 13 non-overlapping trigger types are all improved.

MLTrigNer compared with other trigger recognition systems
We compare the performance of the proposed transfer learning based trigger recognition system, MLTrigNer, with other leading systems on the same Data NMLEE dataset. Since Data ST09 as the source dataset shows the better performance from the results in Tables 5 and 6, we utilized Data ST09 to train the MLTrigNer Model as the source dataset. The detailed F1 measure results are illustrated in Table 7. Pyysalo et al. [10] defined an SVM-based classifier with rich hand-crafted features to recognize triggers in the text. Zhou et al. [13] also defined an SVM-based classifier with word embeddings and hand-crafted features. Nie et al. [14] proposed a word embedding-assisted neural network model to model semantic and syntactic information in event trigger identification (the results were converted to 19 categories). Wang et al. [15] defined a window-based convolution neural network (CNN) classifier. Rahul et al. [16] proposed a method that uses a recurrent neural network (RNN) to extract higher-level sentence features in trigger identification.
From Table 7, we can draw two conclusions. First, our generalized transfer learning approach achieves the best result on the dataset Data MLEE , which indicates that our MLTrigNer can still improve the performance of biomedical trigger word recognition. Second, from Table 5, the TL Model C achieves competitive results compared to these leading systems, which means that the improvement of our generalized transfer learning approach is achieved on a relatively strong basis.

Transfer performance analysis on highly related domains
We conduct an in-depth study and detailed comparison on the highly related domains of Data ST09 and Data MLEE to show the learning ability of our proposed approach. In our study, there are two datasets with the different overlapping degrees of the labels used as source domains to transfer knowledge, respectively. Between them, Data ST09 is highly related with the target domain. Its trigger types are nested in those of the target domain dataset from Table 1. Hence, we can simply put the Data ST09 and the training and development sets of Data MLEE together to train the BiLSTM-CRF model without transfer learning (Basic Model A), and then the model is tested on the test set of Data MLEE . Its performance is shown in Table 8  From the results we can see that the performance even declines when just simply mixing nested datasets together. On the other hand, the performance can be improved using our transfer learning approach. In the process of trigger recognition, the shared knowledge brought by the transfer learning is more important than the data itself.

Ratio effect analysis on source data
It is important to analyze the effect of the ratio of source domain data. First, we use Data ST09 as the source dataset, which is more than 3.6 times the size of the target domain dataset. We keep the size of target data unchanged, and gradually change the size of source data. The changes in the MLTrigNer Model results are shown as a curve in Fig. 5, with the source ratio as 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% and 100%. We can see that F1 first goes up continuously as the source data is added. Then it reaches a maximum of 81.31 when the source ratio is 80%. Finally, it trends downwards even as more source data is added, reaching 80.46 with 100% data in Data ST09 . The results verify that more data from source domain does not always lead to better performance in target domain. In our study, the optimal source/target ratio is about 2.9 : 1 when maximum performance achieved in Data MLEE . In order to optimize the performance of the model under different datasets, we set the ratio of source domain data to be one of the important hyperparameters of the MLTrigNer model, which is tuned on the training and development sets using cross-validation.
Then, we use Data EPI11 as the source dataset alternatively, which is about 3.1 times the size of the target domain dataset. We also keep the size of the target data unchanged, and gradually change the size of the source data. The changes in the MLTrigNer Model results are shown as a curve in Fig. 6, with the source ratio as 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% and 100%. Similar trends are found in the Figs. 5 and 6. The values of F1 measure first goes up continuously as source training data is added, and reaches a maximum of 79.68 when the source ratio is 90%. Then, it trends downwards even as more source data is added, reaching 79.45 with 100% data in Data EPI11 . After tuned on the training and development sets using cross-validation, the optimal source/target ratio is about 2.7 : 1 when maximum performance achieved in Data MLEE .

Error analysis
From the metrics in Tables 5 and 6 we can notice that the results of the trigger type "Dephosphorylation" are all zeroes regardless of the models. From a more detailed list of types and sizes of trigger words of the Data MLEE in Table 9, we can see that there are only 6 Fig. 4 The network architecture of TL Model B: Transfer learning BiLSTM-CRF model with the different label sets, having Embedding layers, BiLSTM layers, Fully-connected layers and CRF layers for the source and target networks, respectively. The parameters can be transferred in the Embedding layers and the BiLSTM layers "Dephosphorylation" instances in the Data MLEE . Without adequate training instances, the recognition results of the Basic Model A and TL Model C are very poor. Moreover, with our transfer learning approach, its recognition results of the MLTrigNer model are still zeroes under the situation that "Dephosphorylation" is an overlapping trigger type. This is a limitation of our transfer learning approach that it cannot transfer enough knowledge from other triggers for labelling the rare trigger types.

Conclusions
In this paper we develop a novel transfer learning approach for multiple-level event trigger recognition based on a DNN model. We design a more general transfer learning approach to set the cross-domain transfer, which can share as much knowledge as possible between the source and target datasets, particularly encompassing the case of overlapping label sets. In the experiments, the source datasets having varying degrees of overlapping labels with the target dataset are utilized to verify the effectiveness of our proposed MLTrigNer model. Compared with the basic transfer learning model, our approach improves the performance on the target domain further. Moreover, its performance exceeds other leading trigger recognition systems on the same MLEE corpus. Hence this study contributes to the effective recognition of biomedical trigger words from text across multiple levels. Through analysis, it is found that there are three essential factors mattering to our crossdomain transfer learning approach: the degree of overlapping of the source and target domains; the number of sharable parameters in each layer of a network; and an appropriate size of the source and target datasets. In the future work, more source datasets from different biomedical event levels with varying degrees of overlapping label tags can be utilized together to improve the performance further.

Methods
In this section, we introduce our proposed transfer learning approach. Our solution for trigger recognition is based on a Bidirectional LSTM-CRF model (BiLSTM-CRF) [32], which uses a deep neural network, Long Short Term Memory (LSTM) [33], to extract higher-level abstract   features to train a CRF [34]. We design a transfer learning approach to allow for joint training with a source dataset, which uses an input feature set and a output label set that overlap with the target dataset, respectively. We first introduce and describe the architecture of the BiLSTM-CRF model as Basic Model A. We then introduce the cross-domain transfer learning BiLSTM-CRF model with inconsistent label sets as TL Model B, and in addiction with inconsistent input feature sets as TL Model C. In these results, the best F1 value of our MLTrigNer system is marked in bold Finally, our proposed generalized transfer learning model, Generalized TL Model D, is described in detail. The different architectures of the four models are shown in Figs. 1, 4, 2 and 3, respectively.

Basic model a: biLSTM-CRF model
We present our trigger recognition task based on the BiLSTM-CRF model as Basic Model A, whose architecture is shown in Fig. 1. In Basic Model A, θs denote all the In these results, the best F1 value of our MLTrigNer model is marked in bold Fig. 5 The ratio effect of source domain data Data ST09 to our transfer learning model, MLTrigNer, with the ratio as 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% and 100% trainable parameters in each network layer. This model detects trigger words and annotates their types, and its performance servers as the baseline. For a given input sentence {word 1 , word 2 , ..., word n }, the aim of trigger recognition is to output a tag sequence {tag 1 , tag 2 , ..., tag n }, where word i is a word (or a token) in the sentence and tag i denotes its corresponding type label. The value of tag i belongs to the label set, which is a biomedical event type or negative if it does not indicate any event. The BiLSTM-CRF model feeds a set of features for an input embedding layer (with parameters θ Emb ), extracts higher-level abstract features in subsequence BiLSTM (with parameters θ LSTM ) and fully-connected (with parameters θ F ) layers, and trains a CRF layer for the final sequence labelling. The main layers of the BiLSTM-CRF model for trigger recognition are described below.

Embedding layer
In order to express both syntactic and semantic information in input sentences, besides each word, word i , we also extract other four features from character, POS, named entity type and dependency parse tree. Through lookup tables, the embedding layer converts each input feature into one of the following representation vectors: 1 Word embedding vector E w : Each word in an input sentence is mapped to a word embedding vector, which contains semantic information from its linear contexts. In this paper, we use a pre-trained word lookup table LT w learned from PubMed articles using the word2vec model [35]. 2 Character embedding vector E c : We use an extra LSTM network to extract the orthographic information from the sequence of characters in each input word. Its parameters LT c are weights and biases of the LSTM, which are initialized randomly and trained to output a character-level embedding vector. 3 POS embedding vector E p : We train a POS lookup table LT p to extend the word embedding. It maps the POS tag of each word in an input sentence to a POS Fig. 6 The ratio effect of source domain data Data EPI11 to our transfer learning model, MLTrigNer, with the ratio as 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% and 100% In order to extend features from linear word contexts to non-linear syntactic contexts, each word from an input sentence is mapped to a dependency tree-based word embedding vector, which contains rich non-linear functional and syntactic information. We use a pre-trained word lookup table LT d learned from English Wikipedia using the skip-gram model [36].
In the embedding layer, trainable parameter set can be expressed as θ Emb = {LT c , LT p , LT e }.

BiLSTM layer
This layer takes a concatenation of the output embedding vectors of the previous embedding layer as input, Due to the ability to learn longdistance dependencies in a sequence through designed memory cells, LSTM is a powerful tool for sequence labelling tasks [33]. Suppose that an input sequence to a LSTM layer is {x 1 , x 2 , ..., x T }, and it yields an output sequence of {h 1 , h 2 , ..., h T } by employing the following implementation strategy during training [32], where both sequences have the same length T: where σ denotes the logistic sigmoid function, tanh is the hyperbolic tangent activation function, and all weights (W s) and biases (bs) make up the parameter set (θ LSTM ) of the LSTM layer. More details about the LSTM can be referred to in [32]. In sequence labelling tasks, it is better to be able to process both the past (from the left side) and the future (from the right side) context dependencies in the sequence. Therefore, another commonly used version of the LSTM is employed, called the Bidirectional LSTM (BiLSTM) [32,37]. In the BiLSTM, for each word the forward LSTM captures the features from the left side and the backward LSTM captures the features from the right side. Each word effectively encodes information about the whole sentence.

Fully-Connected layer
The output of the BiLSTM layer at each time step t, obtained by concatenating the outputs of the forward and backward , is mapped to a linear and fully-connected network layer using ReLU activation functions as follows: where all weights (W s) and biases (bs) make up the parameter set (θ F ) of the fully-connected layer.

CRF layer
On the top of the fully-connected layer, a final CRF layer generates a sequence of labels for corresponding words. The CRF layer can learn the strong dependencies across output labels and come into the most likely sequence of the predicted tags [38].

Transfer learning approach
The goal of cross-domain transfer in this study is to learn a sequence labelling model for triggers which transfers knowledge from a source domain to a related target domain.

TL model b
When

TL model c
When with their own domain-dependent features, such as named entity type, the input feature sets of the source and target domains are inconsistent. The BiLSTM layers will have the different parameter dimensions and structures due to the different feature sets. Hence, the parameters of this layer cannot be shared neither. In this situation, the only parameters that can be transferred are from the embedding layer as shown in Eq. 12. More specifically, the shared parameters are those lookup tables trained for domain-independent features, θ s,shared = {TL w , TL c , TL p , TL d }, where TL w and TL d are pre-trained. The TL Model C in Fig. 2 gives an overview of how to transfer the parameters between the neural network layers of both datasets.

Generalized tL model d (MLTrigNer): our transfer learning approach
This study uses the corpus with biomolecular trigger annotations as the source domain dataset and the corpus with multiple-level biomedical event triggers as the target domain dataset. Because of their inconsistent input feature and output label sets, we just can choose the TL Model C shown in Fig. 2 to build a trigger recognizer, without sharing the parameters of the fully-connected and BiLSTM layers. This ignores the information hidden in the overlapping features and labels. It is known in transfer learning that the more parameters are shared, the better generalization can be achieved in the target domain. For this purpose, we propose a generalized transfer learning architecture and approach to share as many parameters as possible to explore the transferability of each layer in a neural network, especially when the feature and label sets are overlapping. As we discussed that parameters stand for the abstract features learned from a neural network. In the basic transfer learning architectures, TL Model B and C, the parameters are chosen to be transferred according to the network layers horizontally. When the label sets of the source and target domains are consistent, parameters from the upper (fully-connected) and middle (BiL-STM) layers can be transferred. Otherwise, when the label sets are inconsistent, the parameters of the whole upper layer closest to the output are discarded in TL Model B. Moreover, when the source and the target domains have inconsistent extracted feature sets, the parameters of the whole middle layer should be discarded in TL Model C. After careful study of the lower (embedding) layer of TL Model C, we find out that all these parameters learned from the source domain can be split into two parts: a source-specific part and a source-target-shared part. Correspondingly, the parameters of the target domain also can be split into two parts: a target-specific part and a source-target-shared part. This kind of divide is vertical within a network layer, and the source-target-shared part of the parameters can transfer the information carried by the overlapping of feature and label sets in the middle and upper layers. The main benefit is that we can include more domain-dependent features in the lower layer. For instance, in our trigger recognition task, there is a different and richer named entity type feature set in the target domain. Figure 3 shows how we generalize the basic transfer learning approach to share as many parameters as possible. As mentioned, the parameters are split into two parts, domain-specific and domain-shared parameters: θ l s = θ l s,speccific + θ l s,shared , θ l t = θ l t,speccific + θ l t,shared where θ l s,shared and θ l t,shared are the parameters shared and mapped through the transfer learning in each layer l, and the domain specific parameters θ l s,specific and θ l t,specific are trained for each domain exclusively.
The degree of parameters to be transferred from the source network to the target network is determined according to the overlapping degrees of the input feature and output label sets between the source and target domains. Figure 3 shows the parameter sharing situation of the MLTrigNer. In general, suppose {x l 1 , x l 2 , ..., x l j , ...} are the inputs of each layer l, {y l 1 , y l 2 , ..., y l j , ...} are the outputs, and parameters θ of this layer are all weights (W l s) and biases (b l s). Since parameters can be divided into the domain-shared and domain-specific parts, their connected inputs and outputs can also be divided accordingly.
For the middle layers, such as the BiLSTM layers, of the source and target networks in Fig. 3, they have domainspecific and shared inputs of feature embedding vectors as [ x l specific , x l shared ]. Hence the corresponding domainspecific and shared connection weights for each output y l j are [ W l j,specific , W l j,shared ], and each output y l j has its own bias b l j . The shared parameters in Eq. 13, θ l s,shared and θ l t,shared , are {W l shared , b l }. We can obtain each output y l j as follows: For the upper layers, such as the fully-connected layers, of the source and target networks in Fig. 3, they have domain-specific and shared label outputs as [ y l specific , y l shared ]. Hence the domain-specific and shared parameters for the corresponding outputs are {W l j,specific , b l j,specific } and {W l j,shared , b l j,shared }, respectively. The shared parameters in Eq. 13, θ l s,shared and θ l t,shared , are {W l shared , b l shared }. We can obtain each domain-specific output y l j,specific and shared output y l j,share as follows: y l j,specific = active_ function W l j,specific T x + b l j,specific (15) y l j,shared = active_ function W l j,shared If the feature sets are the exactly same on both domains, there are no source-specific and target-specific parts of the parameters for the BiLSTM layers, θ LSTM s,specific = ∅, θ LSTM t,specific = ∅. Moreover, under this circumstance, if the label sets are completely different from each other on both domains, there are no source-target-shared parameters for the fully-connected layer, θ F s,shared = θ F t,shared = ∅, which is the TL Model B. On the other hand, if the label sets and the feature sets are inconsistent, we have θ LSTM s,shared = θ LSTM t,shared = ∅ and θ F s,shared = θ F t,shared = ∅, which is the TL Model C.
The training takes place over the following three main phases. First, the network is trained on the dataset from the source domain. Both θ l s,specific and θ l s,shared are learned. Then the shared parameters of each layer are transferred to the target domain, θ l s,shared → θ l t,shared , to initialize the corresponding parts of the target model parameters. Finally, the network is trained on the dataset from the target domain. Both θ l t,specific and θ l t,shared are tuned and optimized.