Biomedical event extraction with a novel combination strategy based on hybrid deep neural networks

Background Biomedical event extraction is a fundamental and in-demand technology that has attracted substantial interest from many researchers. Previous works have heavily relied on manual designed features and external NLP packages in which the feature engineering is large and complex. Additionally, most of the existing works use the pipeline process that breaks down a task into simple sub-tasks but ignores the interaction between them. To overcome these limitations, we propose a novel event combination strategy based on hybrid deep neural networks to settle the task in a joint end-to-end manner. Results We adapted our method to several annotated corpora of biomedical event extraction tasks. Our method achieved state-of-the-art performance with noticeable overall F1 score improvement compared to that of existing methods for all of these corpora. Conclusions The experimental results demonstrated that our method is effective for biomedical event extraction. The combination strategy can reconstruct complex events from the output of deep neural networks, while the deep neural networks effectively capture the feature representation from the raw text. The biomedical event extraction implementation is available online at http://www.predictor.xin/event_extraction.


Background
PubMed recorded over 28 million papers in 2018 [1] which reflects the rapid growth of the biomedical literature. The knowledge and discoveries reported in the biomedical literature receive substantial attention, but the large volume of the literature poses a challenge to information retrieval; therefore, text mining has become an in-demand technology and a popular research focus. Event extraction, which is an effective way to represent the structured knowledge from unstructured text [2], is a fundamental technology for text mining. However, event extraction is particularly difficult due to the complex and Full list of author information is available at the end of the article arbitrary structure of events in biomedicine, so related research is urgently needed [3].
The definition of a biomedical event, according to the BioNLP [4], consists of (1) a trigger word that indicates the existence of an event and belongs to a certain event type and (2) multiple arguments in which an argument can be viewed as a relation between the event triggers and entities or other event, and each argument has an argument type as well. Therefore, the task of event extraction is to recognize the event triggers with their arguments from the raw text.
We illustrate biomedical events with Fig. 1 as example. The word "promote" is an event trigger of the event type Positive Regulation. This event has a Theme argument linked to the word "tumorigenesis", which is an entity of Carcinogenesis type, and an Cause argument linked to "over-expression". Notice that some events can be the argument for other events, i.e., a nested structure, such as "over-expression" serving as an Gene Expression event trigger as well as an argument of Positive Regulation event. Therefore, the event can be viewed as a directed graph in the text for which the node of graph is the event trigger or entity and the directed edges indicate the arguments.
Since biomedical event extraction was defined as a standard task, various methods have been proposed. Most previous work can be classified into three types: rulebased approaches [5,6], traditional shallow machine learning models and deep learning models. The Turku Event Extraction System (TEES) [7,8] is a biomedical event extraction system that uses rich features from dependency parsing. The TEES utilizes a step-wise approach based on multi-class SVMs by breaking down the whole task into straightforward consecutive graph node and edge classification tasks. The EventMine [9] is a similar SVM-based pipeline method with handcrafted features. Majumder et al. [10] exploited a stacking model for biomedical event extraction. The system uses SVC, SGD and LR as the base-level classifiers and takes SVC as the meta-level classifier. A transition-based model for event extraction [11] is another approach leveraging a structured perception for encoding and decoding with a beam search to find the global best prediction.
In recent years, deep learning methods have been applied to this task that extend the feature representation from text and promote the performance. Wang et al. [12] proposed a convolutional neural network (CNN) with multiple distributed features for biomedical event extraction. The distributed features contain not only the word embedding but also trigger types, POS labels and topic representation. Li et al. [13] utilized dependencybased word embedding and a parallel multi-pooling convolutional neural network to extract biomedical events. This approach reserves more information by pooling the multi-segment of a sentence divided by word triggers and arguments. Björne and Salakoski [14] integrated CNN into the original TEES to supply more features, and replaced the SVM classifier with dense layers, which suggested that the inclusion of the neural network significantly enhanced the performance. Li et al. [15] proposed a framework that using gated recurrent unit networks with attention mechanism to extract biotope and bacteria events.
However, deep learning methods are still rarely used for biomedical event extraction, which is partially due to the complexity of the task-specific event structures. More deep models have focused on the sub-tasks of event extraction such as event trigger detection [16][17][18] and relation classification [19][20][21][22], and most of these models obtained superior performance compared to traditional shallow methods.
Despite the success of existing methods in biomedical event extraction, they generally suffer from two limitations. First, most of them heavily rely on manually designed features and usually need complicated natural language processing (NLP) from external NLP toolkits with poor generalizability. Second, these methods organize the task in a pipeline manner and separate it into independent sub-tasks, which simplifies the problem but ignores the interaction between the sub-tasks and makes the process prone to accumulating errors.
Due to the aforementioned limitations, we propose a biomedical event extraction method with a novel combination strategy based on deep neural networks. Our method detects the candidate event triggers and relations from raw text with recurrent neural networks (RNN) and convolutional networks (CNN), and then the Combination Strategy (CS) constructs the event from the detected results by solving an optimization problem. The proposed method takes advantage of neural networks that can represent features from word embedding in semantic space [23] and removes the reliance on feature engineering. The CS, which integrates global information to optimize a penalty, alleviates the error accumulation.
We evaluated our method with three common biomedical event extraction tasks: the Multi-Level Event Extraction (MLEE) [24], Cancer Genetics (CG) and Pathway Curation (PC) from BioNLP Shared Task 2013 (BioNLP-ST2013) [4]. Our method outperforms the state-of-the-art methods for all of these tasks according to overall F1 scores. The experimental results demonstrate the effectiveness and generalizability of the hybrid networks and the CS. Additionalty, our method only needs a minimized task-specific configurations without the adjustment on method, which makes it easy to facilitate for various biomedical event extraction tasks.
The contributions of this paper are summarized below: • Describes the first attempt to use hybrid deep neural networks (CNN and RNN) aimed at achieving biomedical event extraction. • Proposes a novel combination strategy to integrate the detected triggers and relations in an optimized manner. • Utilizes end-to-end learning as well as avoids reliance on manual feature design and external NLP packages.

Dataset
We trained and evaluated our method using three common annotated datasets: CG, PC and MLEE. Each dataset is initially divided into three parts: training, development and test sets, and the statistics of these datasets are listed in Table 1. CG data concerns the extraction of events relevant to cancer, including molecular foundations, cellular tissue, and organ-level effects. PC data targets reactions relevant to the development of biomolecular pathway models. MLEE data focuses on events across multiple levels of biological organization from the molecular level to the organ system level. All of these datasets have provided entity labels for each word so that task can focus on targeting event extraction. The pre-processing is simple; we only split each document into sentences and tokenized them into sequences of words, which does not rely on any additional NLP toolkits.

Training
The main hyper-parameters, which were tuned on the development set, were set as follows: learning rate = 0.007, ratio of class weight of positive and negative classes for TR and RC was 5:1, weight decay = 0.0002, batch size = 16, α = 0.5, β = 0.25, γ = 0.125, threshold t = −2.0 and threshold r = −2.0. The other hyper-parameters of our model are listed in Additional file 1: Table S11. We set k = 10 in inverse sigmoid decay function. The activation function was leaky-relu. The optimizer we used was Adam. We used 5 single models for ensemble learning. Some tricks were employed, including using pretrained Char-level CNN and word embedding on large external corpus [25], Xavier initialization for neural layers [26], dropout in LSTM and undersampling in EE. The undersampling is to keep the class balance of positive/negative event samples in each sentence.
The final model were trained by the union of training and development set through 100 epochs for each task and the loss curve on CG corpus is shown as Fig. 2. The loss of each module was calculated individually and the gradient of them was propagated simultaneously to update parameters at the end of every batch. The loss of TR declines first, followded by the loss of RC, while the loss of EE has the slowest change. This phenomenon is reasonable because the latter two losses rely on the former detected results. The black curve is the total loss that sums up of loss from TR, RC and EE. The oscillation of the loss curve is due to the variance in the length of sentences and different number of events among the batches. The figure shows that the model has converged after 100 training epochs. The 100-epochs training on CG corpus (400 documents) takes about 12 h on an i7-7700 CPU. The prediction of a single model for each document takes about 20 s on average and ratio of time consuming of each module (TR, RC, EE and CS) is 55.97%, 20.75%, 23.06% and 0.23%, respectively.

Performance
We used standard recall, precision and F1 scores as evaluation metrics. The event was regarded as true-positive only if both trigger and arguments were detected correctly. Our evaluation followed the primary criteria, i.e., approximate span matching and approximate recursive matching [27]. Table 2 shows the overall performance of our method and other state-of-the-art methods for CG, PC and MLEE on test set. RelAgent [6] is a linguistically motivated rulebased system to extract biomedical events. NCBI [5] uses an approximate sub-graph matching-based approach. Zhou and Zhong [28] utilized a semi-supervised learning framework with un-annotated corpora. The TEES [7,8] and EventMine [9] are both SVM-based pipeline models with hand-designed rich features. The TEES CNN [14] is the upgraded version of TEES coupled with CNN and uses mixed 5 model ensemble with randomized train/development set split. Wang et al. [12] and Li et al. [13] both developed convolutional network-based methods. Table 2 shows that our method achieved the highest F1 scores for all three datasets by 58.04% for CG, 55.73% for PC and 60.05% for MLEE respectively, which suggests the effectiveness and generalization ability of the hybrid networks and CS. We conducted student's t-test  [29] on the best existing F1 score and F1 scores of our proposed method in multiple runs. The results indicate that the improvements are statistically significant on CG and MLEE task with p-value < 10 −3 , and p-value on PC task is 0.062 (the detailed statistics are listed in Additional file 1: Table S7-S9). The precision was dramatically higher than the recall in all datasets, which was probably due to the highly diverse event schemes and insufficient training set. Table 3 shows the detailed performance for our method and other existing methods for grouped event categories of CG. As shown in Table 3, our method outperformed other methods for all the event categories. The most significant improvement was for Modification events, and these events relied highly on the global contextual information that was modelled precisely by the recurrent network of EE. Through vertical comparison, the lower scores for Regulation, Planned Pro and Modification compared to those of other categories were due to their nested structure, i.e., these events usually took other events as arguments, which were more difficult to correctly detect. The full detailed performance for CG, PC and MLEE is listed in Additional file 1: Table S4-S6.

Alternatives comparison
Before the proposed method was determined, we conducted ablation experiments on several variations of the proposed method to validate the effectiveness of each part of the method. Table 4 shows results of the ablation study, which evaluated the ensemble learning, the EE module, CS algorithm, threshold setting and Char-CNN module. In these experiments, methods were trained on training set and tested on development set. The Single-model is the proposed method in singleton without ensemble learning. The Single-pipeline-model seperates TR, RC and EE into independent networks without parameter sharing.  The Combination-rule-single does not use the EE module and replaces the CS with a rule-based method to assign only one event to each detected event trigger. The Combination-rule-all similarly replaces the CS with another rule-based method to generate all possible combinations from triggers and relations. The EE-probability directly uses the probability outputing from TR, RC and EE to determine the final events instead of using CS, i.e., the method extracts events simply by the classification results of each network modules (setting all thresholds to 0). The Zero-threshold resets both threshold t and threshold r to zero and keeps other setting same as proposed method. The Without-CharCNN removes the Char-CNN module from the method. The Single-modelpipeline and Without-CharCNN need retrain the neural networks while other methods do not retrain the networks since they only change the settings in post-processing steps.
As shown in Table 4, the proposed method achieved higher F1 scores than other alternative variations. The contrast results of Single-model show that the ensemble learning noticeably improved the F1 performance by 3%-5%, which was mainly contributed by the improvement in precision of 6%-8%. Since the randomness in the training led to variance among the different runs (which also caused performance differences by several percentage points, e.g., the F1 scores by multiple runs of Single-model on CG have a standard deviation of 0.532%), the ensemble could eliminate the variance and obtain higher precision. The F1 scores of Single-model-pipeline sightly declined by 0.2%-1.6% that suggests parameter sharing is effective and efficient (the training of Single-model-pipeline takes longer time than proposed joint model). The performance of Combination-rule-single decreased markedly by 3%-6% compared to the proposed method, especially for recall because the method only assigned one event for each detected trigger and multiple events associated with one trigger were discarded through its rule (such multiple events accounts for 59.7% of total events in CG corpus, for example). The Combination-rule-all obtained high recall that is even higher than the recall of our proposed method on CG and PC tasks, but it suffered from much lower precision and the F1 scores decreased markedly because it constructed too many incorrect events. Both of these ablation tests show that the EE module is valuable. Removing EE module yeilded a 0.8%-2.6% decline of F1 score by EE-probability, which indicates that the CS contributes a positive effect by optimizing the penalty value. However, EE-probability obtained higher precision because the method applied stricter extraction. EE-probability assigned an event only when all of TR, RC and EE modules returned positive classification results, thus the detected events were lesser then the proposed method did (e.g., 2225 vs. 2354 on CG development set). Additionally, the comparison with the Zero-threshold demonstrates that setting a lower threshold for the candidate triggers and relations can construct more valid events and promote overall performance. The contrast results in the last row demonstrate that the Char-CNN module is beneficial to the overall performance by providing the lexical features.

Error and limitation analysis
We divided extraction errors into five types, including wrong trigger label, wrong trigger span, wrong arguments, redundant arguments and other errors. Detailed statistics for the errors in prediction phase are listed in Table 5. The  * The statistics are derived by training method on training set and testing on development set of CG/PC/MLEE. * The Wrong T_Label represents the event triggers with the wrong assigned label. The Wrong T_Span represents the range of the trigger words that were wrong (including detected triggers that do not exist in the gold standard). The Wrong Argu indicates that the event trigger was correctly detected but the arguments were wrongly assigned. Similarly, the Redundant Argu indicates that redundant arguments were assigned for correctly detected triggers.
statistics suggested that the most common error type was the wrong trigger span, which constituted about half of the total errors, indicating that the range of trigger words is the most difficult information to detect. Moreover, similar to other works, two special cases were ignored to simplify in our method. First, a few events had trigger words and arguments spanning across more than one sentence, but our method only detects events within a single sentence. Secondly, a few words were associated with more than one event trigger labels, but our method could assign only one trigger label to them. Ignoring such cases could cause a performance reduction, but these cases are relatively rare (approximately 2% -4.5%) and had limited effect (see Additional file 1: Table S10).
Some limitations still exist and need further improvement. Our method is based on a deep neural network with a large number of learnable parameters but the training set with hundreds of documents is somewhat insufficient. The tuning of hyper parameters relied on grid searching in a development set, which was time consuming.

Conclusions
In this paper, we present deep neural networks coupled with a combination strategy to extract biomedical events. Our method detects the event trigger and classifies the relations jointly while taking advantage of deep neural networks that extract feature representation automatically and do not rely on manual feature engineering. This novel Combination Strategy integrates the outputs from different stages to construct the events in an optimization manner, which alleviates the error accumulation. The evaluation results show that our method has achieved state-of-the-art performance compared to existing methods, which indicates that the Combination Strategy and the deep neural networks in our method are effective. In the future, we plan to extend our method with semisupervised learning to address the insufficiency of the training corpora. Since biomedical text mining is a desirable technology for converting the large number of articles to structured information at high-layer semantics, we believe the proposed method has the potential to facilitate event extraction in broad, real-world scenarios for researchers.

Methods
We apply end-to-end supervised deep learning for event extraction. The overall architecture of the networks is illustrated in Fig. 3, which consists of 5 modules: the Character-level CNN (CharCNN), Bi-directional LSTM (BiLSTM), Trigger Recognition (TR), Relation Classification (RC) and Event Evaluation (EE). The CharCNN and BiLSTM encode a sentence into a sequence of feature vectors. The TR, RC and EE are stacked on BiLSTM and determine the type and probability of each event trigger, relation and candidate event, respectively. These modules are trained simultaneously in a joint manner, which can benefit from parameter sharing [19,21]. Finally, the outputs of these modules are integrated into the CS, which is a post-processing step that is applied in prediction phase to generate the final events.
We cast the argument assignment as a relation classification task, so we use the term relations instead of arguments in the rest of this paper.

Character-level cNN
Character-level CNN (CharCNN) extracts the characterlevel features of each word. The module is inspired by previous work [21] that has been shown to be effective due to the ability to capture the morphological information [30].
For each word, the module first looks up an embedding layer to get the vector representation of each character. Let the sequence of embedding vectors be i is the vector of i-th character. Then, the sequence is fed to convolution layer, which is computed by: where k denotes the kernel size, conv(·, ·) is the convolutional operator, f (·) is the activation function, W 1 ∈ R d×k×nc is the parameter of the convolution layer where nc is the number of output channels and b 1 ∈ R nc is the bias vector. Therefore, we obtain the representation matrix y (c) ∈ R n×nc for an n-length word. To obtain the fixed length representation of the word, the adaptive max pooling is then applied to the output vector: where ch ∈ R nc is the char-level representation of the word.

Bi-directional lSTM
The Bi-directional LSTM (BiLSTM) encodes a sentence into a list of hidden vectors. The LSTM can model the long-distance dependency that benefits from its memory and forget blocks, and a signal from two directions helps the module sense the context [31]. Given a sentence with n words, the word embedding layer maps each word into a vector as w i . Similarly, the entity label of each word is also mapped to vector e i by the entity label embedding layer. We have obtained the character-level representation of each word denoted as ch i from Char-CNN. Then, the above vectors are concatenated and denoted as v i =[ w i , e i , ch i ]. The vector representation of n words forms V = {v 1 , v 2 , ..., v n }, and then it is fed to the LSTM layers with two parallel (forward and backward) directions. The computation of the LSTM layer at the time step i is: where i i , f i , g i , o i and c i are the input gate, forget gate, intermediate state, output gate and cell state, respectively. σ (·) is the sigmoid function. h i = − → h i , ← − h i is the hidden vector from LSTM at time step i, which consists of two directions. Finally, we obtain the sentence encoding sequence H = {h 1 , h 2 , ..., h n } ∈ R 2hd×n where hd is the hidden size. Additionally, we also obtain the sequence of entity label embedding E = {e 1 , e 2 , ..., e n }.

Trigger recognition
We cast the Trigger Recognition (TR) as a sequence labelling task. The TR module receives the output of BiL-STM and assigns an event label to each word in the sequence in the BILOU scheme [32].
Given the input sequence H = {h 1 , h 2 , ..., h n }, we assign the label in a greedy manner from left to right. At the time step i, we concatenate the encoded vector h i and previous event trigger label vector t i−1 into x i =[ h i , t i−1 ] and then send them into a linear layer and a log softmax layer, which is written as: where W 2 and W 3 are the weight matrices of the two linear layers, respectively, b 2 and b 3 are the bias vectors, and f (·) is the activation function. The softmax layer transforms the y (t) i to the trigger label probability vector p (t) i . The event trigger label of the i-th word is assigned as the m-th trigger type where m is the index of the maximal element in i,none > threshold t ; otherwise we assign none to the word. The trigger label is then transformed to t i by the event trigger label embedding layer and then sent to the next time step. Finally, we obtain the sequence of trigger label embedding vectors T = {t 1 , t 2 , ..., t n }.
Here we define a support value to measure the confidence of the assigned label.

Definition 1 A Support Value is the probability difference between the assigned label m and the label none.
The lager support value means the more confidence of the recognised label.
For each recognized event trigger with index m, a support value is computed by: where none is the index of none type.
In the training of TR, we use a scheduled sampling trick to eliminate the gap between training anf inference [33]. We take inverse sigmoid decay function = k/(k + exp(i/k)) to decide the probability of using true token or inference token, where i is the number of training epochs.

Relation classification
The Relation Classification (RC) module predicts the relation type for each candidate pair. The sentence representation is derived from BiLSTM, and the event trigger is detected by the TR before. Here, we use the information from the events/entities and the sub-sentence between them to predict the type of relation.
Since the TR module has detected the event triggers and the entities are given, we combine all possible trigger-entity and trigger-trigger pairs. The set of candidate pairs is written as RL = { e (1) , e (2) |e (1) ∈ event_trigger_set, e (2) ∈ event_trigger_set ∪ entity_set}. To predict the relation type of (e (1) , e (2) ) ∈ RL, we first take the truncated sequence of BiLSTM hidden vectors H start 1 :end 1 , the sequence of trigger label vectors T start 1 :end 1 and entity label vectors E start 1 :end 1 , where start 1 , end 1 are the start and end location of e (1) . The three vectors are concatenated into src = [ H start 1 :end 1 , T start 1 :end 1 , E start 1 :end 1 ]. Then, we get dst = [ H start 2 :end 2 , T start 2 :end 2 , E start 2 :end 2 ] in same manner for e (2) . The sub-sentence between e (1) and e (2) is denoted as mid =[ H end 1 :start 2 , T end 1 :start 2 , E end 1 :start 2 ]. Here, we assume end 1 < start 2 without loss of generality. If e (1) and e (2) are adjacent, we assign a learnable vector to mid to represent this situation.
The src and dst are then fed to the adaptive max pooling layer to get a vector with a fixed shape that is denoted as src max and dst max , which is the same as Eq. 2. Meanwhile, the sub-sentence mid is fed to a convolutional layer and an adaptive max pooling layer to extract the feature vector, which is denoted as mid max , same as Eq. 1.
Additionally, Zheng et al. [20] mentioned that the distance of two entities start 2 − end 1 can determine the relation significantly. Therefore, we use a distance embedding vector to provide extra information that is ignored by the max pooling operation, and the vector is denoted as The above vectors are concatenated into r = [ src max , mid max , dst max , d v ] and apply two linear layers: where W 4 , W 5 and b 4 , b 5 are the weight matrices and bias vectors of two linear layers, respectively. Finally, the vector is fed into a log softmax layer as follows: where p (r) ∈ R c is the probability that the relation belongs to each relation class of total c classes. We assign the m-th class of relation to the candidate pair (e (1) , e (2) ) where m is the index of the maximal element of p (r) except for p none > threshold r ; otherwise we assign none to the relation.
For each recognized relation with class index m, we compute a support value by: where none is the index of none type.

Event evaluation
Since the entities, event triggers and relations have been recognized, we then apply the Event Evaluation (EE) module to estimate the probability that a candidate event structure is a valid event.
We use another BiLSTM to represent the event structure. We have obtained the word encoding sequence H = {h 1 , h 2 , ..., h n } and the entity label embedding sequence E = {e 1 , e 2 , ..., e n } from BiLSTM, the trigger label embedding sequence T = {t 1 , t 2 , ..., t n } from TR, and the recognized relation set RL_recognized = e (1) i , relation_type i , e (2) i i = 1, 2, ... from RC. Then, we enumerate all the valid combinations of event triggers and relations that have to meet the structure definition of the certain task. All the valid combinations in a sentence form the candidate events set denoted as C.
For each candidate event in C, a sequence of role label R = {r 1 r 2 , ..., r n } is leveraged to represent the role of each word in the candidate event. The i-th item in R is assigned to event_trigger or relation_type k or none_type according to the role that the i-th word plays. The role label sequence is then transformed into vectors by the role label embedding layer, and the result is denoted as n . Then, we concatenate each corresponding vector from sequence H, E, T and V (e) in parallel, which is written as The sequence X is fed to another BiLSTM layer: The last outputs from two directions of BiLSTM h are fed to a full connection layer and then mapped to log probability p (e) ∈ R 2 by a log softmax layer. The probability vector p (e) denotes that the event is a positive sample or a negative sample. Meanwhile, we use another full connection layer and log softmax layer, which is branched from h (e) last , to learn a vector p (m) ∈ R 3 that denotes the modification information of the event. The three items in p (m) represents the log-probability of Negation, Speculation and None, respectively.
For each candidate event, we compute a support value by: where p (e) 1 is the log probability that the candidate event is a valid event and p (e) 2 is not. After we obtain the probability vectors p (t) , p (r) , p (e) and p (m) from TR, RC and EE, an end-to-end training is then Fig. 4 A simple example to illustrate the principle of CS applied to optimize the NLLloss of these vectors according to the correct labels of triggers, relations, events and modifications, respectively.

Combination strategy
After we have obtained the support value of each candidate trigger, relation and event, the Combination Strategy (CS) is then applied to integrate all of the information and determine the set of final extracted events in prediction phase. CS does not participate in training.
The principle of CS is to minimize the penalty value, which is designed to measure the discordance between the final events and the outputs of previous modules. The penalty consists of two parts as follows. If a trigger or a relation has a positive support value, but it does not appear in any event of the final extracted events, it generates a penalty called "support waste". In contrast, if a candidate event with a negative support value is included as one of the final extracted events, it generates a penalty called "support lacking". These two penalties are conflicting objectives. The more candidate events that are added into the final set, the less "support waste" penalty will be, but the penalty for "support lacking" will increase, and vice versa. So, the goal will be to minimize the total penalty so that the final extracted events can be determined.
Formally, this target is written as follows: where c best is the set of final extracted events, c enumerates all the subsets of C, and: where s (e) k denotes the support of the k-th candidate event, s (t) i denotes the support of i-th event trigger, and s (r) j denotes the support of the j-th relation. The first term accounts for the support lacking penalty, the last two terms accounts for the support waste penalty and reweighted by parameters α, β and γ . Figure 4 is a simple example to illustrate the principle of CS. As shown in Fig. 4, three entities (T1, T3 and T5) were given, and two event triggers (T2 and T4) and six relations were detected, then a total of six candidate events were constructed from them. After computing them, the minimized penalty was obtained when choosing the two of them to be the final result.
To optimize Eq. 11, we have to enumerate all of the subsets of C, which requires exponential time complexity. For some sentences, there are many complex events with many candidate arguments and lead to a very large computation cost (2 6 in Fig. 4 for example). Therefore, the CS, which is an approximation algorithm, is proposed to solve the problem within the O(n 2 ) time complexity.
The CS receives the set of candidate events C, the support value of triggers, relations and candidate events in a sentence, and then it returns the set of final extracted events. Before the CS, the candidate events are sorted by the topological order of their nested structure if it exist, and the events relating to different triggers are handled independently, except for transmitting the support value between them. In the CS, the initial set of chosen events is empty, and then the candidate events are added into the set one by one in a greedy manner. The pseudo code of the CS is shown in Algorithm 1.
To prevent the nested events from forming event loops, we set a loop detector after CS. The extracted events are added into the final set one-by-one, we discard the event if the inclusion of it would cause event loop. We assign event modifications (sepculation/negation or none-modification) for the events in the final extracted set according to their modification vectors p (m) . for each event ∈ C − E chosen do 5: penalty = penalty_score(E chosen ∪ event); 6: if penalty < penalty best_tmp then 7: penalty best_tmp = penalty; There are two advantages of the CS. First, the CS can address event triggers and relations with negative support value. In the TR and RC, we can set negative thresholds threshold t and threshold r to obtain more candidate event triggers and relations, which is especially useful for