Self-training in significance space of support vectors for imbalanced biomedical event data

Background Pairwise relationships extracted from biomedical literature are insufficient in formulating biomolecular interactions. Extraction of complex relations (namely, biomedical events) has become the main focus of the text-mining community. However, there are two critical issues that are seldom dealt with by existing systems. First, an annotated corpus for training a prediction model is highly imbalanced. Second, supervised models trained on only a single annotated corpus can limit system performance. Fortunately, there is a large pool of unlabeled data containing much of the domain background that one can exploit. Results In this study, we develop a new semi-supervised learning method to address the issues outlined above. The proposed algorithm efficiently exploits the unlabeled data to leverage system performance. We furthermore extend our algorithm to a two-phase learning framework. The first phase balances the training data for initial model induction. The second phase incorporates domain knowledge into the event extraction model. The effectiveness of our method is evaluated on the Genia event extraction corpus and a PubMed document pool. Our method can identify a small subset of the majority class, which is sufficient for building a well-generalized prediction model. It outperforms the traditional self-training algorithm in terms of f-measure. Our model, based on the training data and the unlabeled data pool, achieves comparable performance to the state-of-the-art systems that are trained on a larger annotated set consisting of training and evaluation data.


Background
As biomedical literature on servers grows exponentially in the form of semi-structured documents, biomedical text mining has been intensively investigated in order to find information in a more accurate and efficient manner. The previous efforts have focused on recognition of entity mentions, such as genes, proteins, diseases, or drug names [1][2][3][4][5][6], and on extraction of pairwise relationships, such as protein-protein interaction [7] and gene-disease association [8]. The named entities recognized, and pairwise relationships extracted, are insufficient for understanding biomolecular interactions [9]. Therefore, extraction of complex relations (namely, biomedical events) has received increasing attention. According to the BioNLP series challenges [10][11][12], a biomedical event is formulated as follows: an event has a trigger, a type and a set of arguments. An element of the argument set has a role and can be a protein mention or another event, depending on the event type. For example, 'RFLAT-1 activates RANTES gene expression' (PMID: 10023774) describes two events, one is simple gene expression of RANTES and the other is complex positive-regulation event which is caused by RFLAT-1, anchored by 'activates' and has the gene expression event as its argument. The BioNLP GE'09 and '11 challenges target three subtasks addressing event extraction at different levels of specificity: event detection, event enrichment and negation, and speculation detection. As solutions to the BioNLP challenges, many methods have been proposed to predict biomedical events from text. The solutions include rule-based and machine learning (ML) approaches. Rule-based event extraction systems rely on a rule set that is manually collected or automatically induced from training data [13][14][15][16]. The rule-based systems tend to achieve high precision with low recall and to perform better on prediction of simple events. To increase the recall, a rule-generation process has to process a huge amount of text to collect the rule set with high coverage. Since most computation is for matching pre-generated rules against text, such systems show good performance in terms of computation efficiency.
An ML-based event extraction system sees the task as a classification problem. The proposed approaches can be divided into three groups, depending on how recognition of the event trigger and argument is designed. The systems belonging to the first group are based on a textmining pipeline approach [17][18][19][20]. Björne et al. [17] first adopted the pipeline approach in their Turku event extraction system. The first version of the Turku system consists of three stages: trigger detection, edge detection, and event duplication. The first two stages solve ML classification problems, and the last relies on a rule set. Miwa et al. [18] later improve the pipeline approach in their EventMine system by introducing an additional classification model for the rule-based event duplication task. In general, each stage in the extraction pipeline solves multi-class and multi-label classification problems based on an imbalanced dataset with a high dimensional feature space. A linear support vector machine (SVM) with oneversus-rest label decoding has been a main tool for this. For this group of approaches, errors made in a former step propagate into subsequent steps, introducing an error cascade. To overcome this issue, the second MLbased group uses global models that solve the whole task at once [21,22]. Riedel et al. [21] encoded graph structures of events using a set of binary variables representing the type of token nodes and the relations between them. The state of these variables is predicted by maximizing the global likelihood of integer linear programming. This joint model achieves good performance, but could be overly complicated for finding optimal states because it has to include every combination of tokens in the search space. To reduce the search space, they use a dictionary of triggers from the training data. This might in turn decrease the overall recall. McClosky et al. [22] aimed at solving the task as dependency parsing by exploiting global properties of event structures. The third ML-based group combines the global and pipeline approaches by using a pairwise model [23]. The pairwise model jointly predicts the trigger and the argument of an event as a pair of text parts, in contrast to the pipeline approach. However, such a model still has subsequent steps for prediction of events with more than one argument, and is unable to extract nested events.
There are two critical issues that are seldom dealt with by the aforementioned systems. First, training data is highly imbalanced. From traditional sampling [24] (under-sampling or over-sampling) to active learning [25], solutions have been tooled to induce prediction models on such imbalanced datasets [26]. In Björne et al. [27] and the EventMine system, a simple class weighting method with an SVM [28] is used. Second, the supervised models trained on only a single annotated corpus can limit system coverage and scalability. Besides merging multiple annotated corpora into one [29], semi-supervised learning (SSL) is applied to overcome this issue [30]. SSL has received significant attention for two reasons. First, annotating data for training is time-and labor-intensive. For instance, annotating the Genia event corpus consisting of 9372 sentences required 1.5 years with five part-time annotators and two coordinators [31]. Second, because SSL exploits unlabeled data, the accuracy of classifiers is generally improved.
In this study, we combine the approaches of active learning and self-training and develop a new semi-supervised learning method to address the issues outlined above. Our algorithm is built upon the foundation of significance space construction [32]. The training data are augmented by a new example set from unlabeled data, so model learning with them captures patterns from domain background. The new example set is formed based on its significance and confidence score for self-labelling.
Furthermore, we extend the algorithm to a two-phase learning framework. The first phase balances the training data for initial model induction, whereas the second phase incorporates domain knowledge into the model by querying the example set from the unlabeled data. In both phases, the model is built in an online fashion. We evaluated the proposed method on the GE'11 corpus. First, we compared the method against the approaches used to solve the data imbalance problem. Our method can identify a small subset of the majority class that is sufficient for building a well-generalized prediction model. Second, we contrasted it against the traditional self-training algorithm as an evaluation of the semi-supervised learning perspective. Finally, we investigated the event extraction system performance, relying on our proposed method to report the different values of the evaluation measures along with the GE'11 shared task entries. Our model, which learned only on the training data, achieves comparable performance to the state-of-the-art systems that are trained on both GE evaluation and training data.

Methods
We combine the approaches of active learning and selftraining to develop a new semi-supervised learning method, which we call self-training in significance space (STSS). An STSS-based two-phase learning framework is also proposed in this study, in order to leverage system performance and to solve the data imbalance issue by exploiting unlabeled data.

Text preprocessing and feature extraction
Text preprocessing where text data is cleaned and processed via natural language processing (NLP) tools is a preparatory task for the feature extraction step. We adopt the Turku system [27] for preprocessing and feature extraction. Turku allows extracting a rich number of features, and is well tuned for processing PubMed abstracts for the purpose of event extraction.
In preprocessing, the text data is cleaned by removing non-standard characters, and is then processed with NLP tools to have its sentences split and parts of speech (POS) tagged and parsed. We use the Genia Sentence Splitter (GeniaSS) [33] and the Charniak-Johnson parser [34] with McClosky's biomedical parsing model [35]. The GeniaSS relies on the maximum entropy models and is optimized for bio text data. McClosky's model was built with self-training, incorporating the domain knowledge from PubMed abstracts.
In addition to the preprocessing step, we recognize biomedical named entities in the unlabeled text. We use the Bayesian finite state model (BFSM) with the Bayesian classifier from our previous study [36] and the BANNER tool [37] for biomedical named entity recognition, in order to improve recognition accuracy. Our named entity recognition system solves the name boundary issue and achieves a high precision, whereas the BANNER tool utilizes a sequence labeling model, conditional random fields, and shows reliable performance on the task.
Once a graph representation of the full dependency within a sentence is obtained, we extract features for the event extraction model with the Turku system. The following four different feature sets are extracted.
Token features: orthographic features, POS tags, base words with the Porter stemmer, and character n-grams (n = {1, 2, 3}) Sentence features: the number of entities and bag-ofwords Sentence dependency features: n-grams of the words on the shortest path between two entities, and features based on triggers present External resource features: WordNet hypernyms [38], and a similarity measure against lexicons of biomedical terms.

Self-training in significance space of support vectors
The proposed semi-supervised learning method relies on the concept of significance space, which could be constructed by the different approaches from the original training data. Significance space with an SVM classifier depends on the feature space of the support vectors (SVs). The flow chart of the proposed method is shown in Figure 1.
First, the original training set is used to build the base classifier, the SVM 1 model. STSS then forms the significance training set, S, by labeling the training examples that are SVs of SVM 1 for significance and the remaining examples of the training set for non-significance. In the next step of the current round, the SVM 2 model is trained via S in order to query the significant subset, U, from unlabeled or labeled data. Different usages of STSS in the learning framework are discussed in the next section The SVM 2 model can be applied to either labeled data, to select most informative labeled examples, or to unlabeled data, to let the SVM 1 model give class labels to the informative subset. If the SVM model classifies the significance examples of the unlabeled data, then the confidence-based filtering module is employed to select the confidently labeled examples, L c , with a threshold criterion. We investigate various instance selection strategies, most of which are based on the probability outcome from the classification model:

STSS-based two-phase learning framework
We propose a two-phase learning framework based on the STSS algorithm to solve the imbalanced data problem, while exploiting unlabeled data efficiently. The first phase solves the data imbalance problem by selecting only a small and informative subset of the majority class. First the original training data, D, is partitioned into two parts: D i to be initial training data, and D u to be used as the unlabeled data (D i << D u ) for the STSS algorithm. Then, the STSS runs to build the SVM 1 model while drawing a balanced subset of the original training data: D s . Since each instance in the D u set already has a true class label, the self-labeling schema with SVM 1 is not utilized for the first run of the STSS (the STSS run of Phase I).
The second phase is designed to exploit unlabeled data in order to improve the performance of the SVM 1 model outputted by the previous stage. The SVM 1 model, a balanced subset of training data D s and previously prepared unlabeled data U, is the input for this phase. We simply run the STSS and update the SVM 1 model with the self-labeled informative examples chosen from the unlabeled data. For each running iteration of the STSS, the SVM 1 model is updated and stored for further evaluation.

Base classifier
The dataset extracted from a biomedical event corpus simply becomes high-dimensional, having hundreds of thousands of attributes, and needs a solution to the multi-label and multi-class classification problem. An SVM classifier with a linear kernel is a benchmark in classifying such high-dimensional data.
The linear SVM classifier is well suited to a binary classification problem, and thus requires additional wrapper methods to deal with the multi-label and multi-class classification task. We use a one-versus-rest approach with linear SVM. The one-versus-rest approach builds the same number of classifiers as the classes in the dataset, treating each classifier as an individual component to discriminate between the instances of one class and the instances of other classes, and combines the classifiers in a simple voting schema to make the class decision on a test instance. In the experiment, we train the classifiers in a parallel manner, running a per-core learning process to speed up the model induction process.

Results
We evaluated the proposed method on the GE'11 dataset by comparing its performance with state-of-the-art systems. First, we compared the method against the approaches used to solve the data imbalance problem. Second, in order to observe the effectiveness of the method from the semi-supervised learning perspective, we contrasted it against the traditional self-training algorithm. These comparisons were accomplished by using a dataset generated for the edge detection subtask. Finally, we investigated the event extraction system performance, relying on our proposed method to report the different values of the evaluation measures along with the GE'11 shared task entries. Our source code for the algorithm, the two-phase learning framework, and the other benchmarks are available at https://bitbucket.org/ tsendeemts/stss.

Text corpus and dataset
Because the STSS method is a semi-supervised learning method and is able to exploit raw text data, we prepared a PubMed document collection in addition to the GE'11 dataset. Therefore, we used the GE'11 dataset as labeled data and the PubMed collection consisting of around 1,400,000 abstracts as unlabeled data for this experiment. The GE'11 dataset is a super-set of the GE'09 dataset [11] and is split into training, development and testing sets, consisting of 800 abstracts (+5 full papers), 150 abstracts (+5 full papers), and 260 abstracts (+4 full papers), respectively. We applied the training set for model induction and the development set for evaluation of models for the subtasks throughout the experiment in this study. Finally, the test set was used to report the whole-system performance based on the STSS method. We adopted the pipeline approach, so the three different datasets (for trigger detection, edge detection or event construction) were extracted from the original set. Table 1 provides the class label distribution over the instances in the training set for edge detection. The first column and the second column contain the class label representing trigger-entity connection and the number of instances belonging to every class, respectively. The third column reports the class imbalance ratio to demonstrate one of our main concerns in this study: that the event extraction datasets are highly imbalanced. In fact, event detection and trigger detection are considered to be challenging classification problems in machine learning, caused by the data imbalance nature and the high dimen-

STSS for the imbalanced data problem
We compared the performance of the STSS method against the baseline approaches, under-sampling and the class weighting methods on the edge detection dataset. The class weighting method is tightly integrated with SVM classifiers and has been used previously in many event extraction systems [17,18,23].
Since there are only three instances of the Site-Theme class in the training dataset, we removed this class and reduced the number of distinct class labels to 7. The different settings for each baseline were evaluated on development to tune the corresponding hyper-parameters, and the best setting giving the highest accuracy was selected. We under-sampled the negative examples so that the class distribution remained in ratios of 1/3, 1/4, 1/5 and 1/6, and the best sampling reported in this experiment was the dataset with a ratio of 1/4. For the class weighting run, every class received a weight that is inversely proportional to the class frequency; therefore, the minor classes received a higher score than the major classes. The second strategy among the instance selection strategies defined in Section 2.2 worked slightly better than the others in our STSS implementation.
The overall performance of the different approaches is shown in Table 2. As we expected, STSS performed best on this imbalanced dataset, followed by the class weighting method. Surprisingly, none of the methods were able to classify instances of the AtLoc class, showing 0.0% for the f-measure. This might be due to non-representative features for the class. We used the SVM classifier with a linear kernel that is able to train on the dataset with a large number of attributes, even though arbitrary classifiers could be applied to the under-sampled and class-weighted datasets.
Finally, we investigated the data efficiency of STSS by comparing original and STSS-training datasets constructed especially for one-versus-rest SVM model induction.
The STSS training dataset is an informative subset of the original, effectively sampled through the first phase of the STSS-based two-phase learning. We The comparison is shown in Table 3 which reports the statistics for the original and the STSS training datasets. Sampling the informative subset dramatically increases the imbalance ratio (an imbalance ratio close to 1.0 is preferred). We can see that the only a small subset of the negative examples in the original dataset can represent the others, and our STSS method can identify them effectively.

Comparison with traditional self-training
In the second phase of the STSS-based learning framework, the STSS algorithm was employed on both labeled and unlabeled data, acting as a typical semi-supervised learning method. Since STSS could be viewed as a variation of self-training, we compared our method with traditional self-training.
We ran the methods five times by using the edge detection training dataset for training and the development dataset for evaluation, and reported the best performance of both methods in contrast.
The overall performance of the two methods is shown in Table 4. We see that our method achieves a higher recall value without losing precision, improving the f-measure in general. The STSS algorithm outperforms self-training by 14.18% for f-measure in the Cause-Theme class. Since the dataset is highly imbalanced, general measurements like precision, recall, and f-measure reported a value of 100% for the negative class.

Performance Evaluation
In this section, we evaluated the performance of the event extraction system based on the proposed method and compared it with GE'11 entries. We trained the classification models of the system with the GE'11 training corpus and then tested the system performance with the GE'11 test corpus (Task 1). Table 5 shows the results of the extraction method evaluated on the test dataset using the Approximate Span/Approximate Recursive matching criteria. The evaluation results of the abstract and full text documents are separately reported to show the type of document on which our method performs better. Our method tends to perform better on full-text documents. This can also be seen from Table 6 where we compare the evaluation results with other GE'11 entries. Our results in terms of f-measure are slightly higher than Turku's result on the full-text documents.
The overall evaluation results are closer to that of Turku because we injected the classification models trained using STSS into Turku's pipeline, sharing the same preprocessing and feature extraction stages. However, our STSS model uses the training dataset and the unlabeled data, whereas the Turku model uses a larger annotated set consisting of the training and evaluation datasets.

Event extraction-based applications
Event extraction from biomedical text benefits a broad range of applications in systems biology, namely literature searching, interaction network generation and pathway construction. Biomedical events extracted from literature are indexed to support a deeper semantic search in comparison to traditional keyword-based search engines. Thus, such a search system can provide results with a high precision to a user. BioContext [39] and MEDIE [40] systems are example of event-based search engines. MEDIE has an intelligent interface for retrieving biomedical events referenced with their literature from the entire MEDLINE. In MEDIA, the user can even use a partial query to have a rich number of results and browse through them. Like MEDIE, BioContext integrates a several different text mining tools, including named entity recognition, entity normalization, full dependency parsing and event extraction to build up. BioContext system processed 10.9 million MEDLINE abstracts and 234 000 full-text articles and made 11.4 million distinct events searchable.
One of the applications based on event extraction is the generation of interaction network. Sufficient accurate event graphs can be used for inferring complex regulatory relationship networks and other biologically relevant tasks [27]. Björne et al. [27] applied their Turku system to 1% of MEDLINE and constructed connected components of the event graph. While there is a wide range of application areas where the event extraction can be useful, many issues remained to be solved in practice. The system performance is still lower in terms of F-measure. The best performing system for GE'11 challenge task showed 53.14% F-measure. Another issue to be addressed is computational requirements. MLbased systems require a large computation time, mostly devoted by full dependency parser in preprocessing step. Björne et al. [27] reported that their Turku system took 98 processor hours (411 processor days for the entire MEDLINE) to extracts events from only 1% of MED-LINE. Even using a cluster of 100 processors, BioContext required 2-3 months to process the full document collection [39]. We used a cluster of 12 processor cores, 4 of which for ML example construction from unlabeled data and the others for training of SVM models. It took 3 months to complete the experiments.  An imbalance ratio close to 1.0 is preferred.

Conclusions
In this study, we proposed a new semi-supervised method and its two-phase framework to build a biomedical event extraction model with imbalanced data. Our method iteratively constructs significance space from training data and augments it with a self-labeled example set that falls into that space. Based on this online process, our learning framework in its first phase forms a balanced subset of the original data for initial model induction, and then in the second phase, incorporates domain knowledge from unlabeled data into the model. Consequently, the framework not only solves the data imbalance problem, but also exploits the unlabeled data to leverage system performance. Experimental results demonstrated the efficiency of our method from multiple perspectives. Our method can identify a small and sufficient subset of the majority class. It outperforms the traditional self-training algorithm in terms of f-measure. Our method builds a well-generalized prediction model with a small training set and additional unlabeled data. The proposed method can be applied to other real-world applications where training data might be small and imbalanced, and unlabeled data is less expensive to collect.
For future work, we will explore representation learning to integrate external resources into biomedical event extraction. We will also investigate methods to avoid complex preprocessing (e.g. full dependency parsing) and to address error cascading. Deep learning methods might help in this.