We extended our previously proposed event extraction system [4] in several ways for the 2013 BioNLP shared task. First, we experimented with incorporating a distributional similarity model into the graph matching scheme to allow for more variation during matching, and second, we explored the use of dependency paths of all possible lengths (rather than only shortest paths) in the pattern induction phase.
We then explored additional changes to the approach in work subsequent to the 2013 shared task: (1) incorporate a node skipping penalty into the subgraph distance function of our approximate subgraph matching algorithm. (2) learn a customized threshold for each pattern. (3) implement the well-known empirical risk minimization (ERM) principle to optimize the event pattern set. Next, we elaborate these system experiments in detail.
Experiments in BioNLP-ST 2013
Integrating distributional similarity model
As described above, the ASM algorithm employs a distance measure based on 3 dimensions of variance that can exist between two graphs. This allows for some differences to exist between two matched graphs. However, the node mapping that is performed between the graphs is based on strict lexical matching. In our previous work, we considered various criteria for node matching, including relaxing of the strict matching to consider token lemmas (L) or POS tags (P), or combinations such as "P*+L" introduced above. However, this still requires fairly tight alignment between a pattern graph and a sentence graph. We experimented with dropping any lemma matching requirement, and only using POS information, but observed a sharp drop in precision. Despite a nearly 14% increase in recall, the overall impact on F-scores was strongly negative [16]. This suggests that word-level information is an important component of matching in the framework of our system.
To allow for additional flexibility in word choice, we decided to explore a refinement of the node mapping strategies that takes lexical variation into consideration. This can be considered another dimension of variance to be supported in the algorithm, and would for instance allow a pattern token "crucial" to match a sentence token such as "critical" which could result in extraction of a relevant event. We previously attempted to allow for such lexical variation by allowing words to match their synonyms (as defined by WordNet [17]) [18]. However, since WordNet is developed for the general English language, it relates biomedical terms e.g., "expression" with general words such as "aspect" and "face", thus leading to incorrect events.
We therefore decided to experiment with a different approach to accommodating lexical variation during node matching, specifically by integrating an empirically-derived similarity model. We implemented a distributional similarity model (DSM); this model is based on the distributional hypothesis [19] that words that occur in the same contexts tend to share similar meanings. We expected that incorporating such a model would increase recall without impacting precision too much.
There have been many approaches to computing distributional similarity of words in a corpus [20, 21]. The output is typically a ranked list of similar words to each word. We reimplemented the model proposed by [21], in which each word is represented by a feature vector and each feature corresponds to a context where the word appears. The value of the feature is the pointwise mutual information [22] between the feature and the word.
Let c be a context and F
c
(w) be the frequency count of a word w occurring in context c. The pointwise mutual information, mi
w,c
between c and w is defined as:
where is the total frequency count of all words and their contexts.
Since mutual information tends to be biased towards infrequent words/features, we multiplied the above mutual information value by a discounting factor as suggested in [21]. We then computed the similarity between two words via the cosine coefficient [23] of their mutual information vectors.
We tried two different strategies to integrate distributional similarity into our event extraction system. In the first strategy, DSM is applied at the node matching step, allowing a match between two unequal lexical items if the sentence token appears in the list of the top M most similar words to the pattern token. The second approach is generative and applies to event patterns. A copy of an event pattern is produced by substituting a pattern token with a similar term; this copying is performed for each of the top M most similar words. The first method results in a more general flexibility during event extraction, while the second method gives the opportunity to measure the impact of each possible token substitution in a pattern separately, and to filter out spurious synonyms during the pattern optimization step.
Adopting all-paths for event patterns
The ASM algorithm was designed to work with only the shortest path between event components [4]. However, there is a body of work that has explored the value of considering all paths in a dependency graph for tasks such as extraction of protein- protein interactions (PPI) [6], event extraction [12], and drug-drug interactions [24]. The latter system, using an all-paths graph kernel, won the recent DDIExtraction 2011 challenge [25]. The kernel includes two representations for each sentence with a pair of interacting entities, the full dependency parse and the linear token sequence. At the expense of computational complexity, this representation enables the kernel to explore the full dependency graph, and thereby the broader sentential context of an interaction.
The shortest dependency path may not provide sufficient syntactic context to enable precise relation extraction. Therefore, borrowing from the all-path graph representation, we experimented with extending the representation used by the ASM algorithm in the pattern induction step to consider acyclic paths of all possible lengths among event components.
Experiments after BioNLP-ST 2013
Incorporating node skipping penalty into ASM
As shown in Definition 1, the subgraph distance design in our system [4] considers variations in edge labels and edge directionalities but insists that a candidate match should possess an injective mapping between nodes of a pattern graph and a sentence graph.
Preserving the complete lexical contexts of an annotated event in the induced pattern has the advantage of achieving precise predictions. However, it often retains terms from a particular textual expression of an event but in fact not essential to the underlying meaning of the event. For instance, the dependency context "induction of binding activity" of a pattern encodes the context of a Positive_regulation event cascaded with a lower order Binding event. Since the term "binding" indicates a binding activity by itself, the additional "activity" is redundant. Similarly, the term "gene" in the dependency context of a Regulation event pattern "regulated BIO Entity gene" is neglectable when the "BIO Entity" itself has been pre-annotated as a gene. Therefore, we hypothesize that providing an option in graph matching to skip the non-essential context words encoded in patterns can improve their generalizability.
We revised the subgraph distance function proposed in [4] by adding in a nodeDist measure which penalizes the number of skipped non-essential nodes normalized by the total number of pattern graph nodes for each candidate match between pattern and sentence graphs. In our experiments, essential context nodes of a pattern are considered to be the nodes corresponding to event triggers and event arguments such as theme or cause. The sub-event trigger is also considered for patterns that encode cascaded events.
Consequently, the original injective mapping f : Vr → Vs as in Definition 1 is relaxed to be where is a set of essential context nodes in a pattern graph. A candidate match can be considered only if a f'' exists between two graphs. In case that the original node injective mapping constraint is satisfied, i.e., no pattern node is skipped, nodeDist becomes 0 and the new distance function is equivalent to the original function. Similar to the weights w
s
, w
l
and w
d
, the non-negative weight w
n
can be tuned to accommodate the emphasis on nodeDist in the distance function. The new function is defined as follows.
subgraph (G, Gs) = w
s
× structDist
f
(G
r
, G
s
) + w
l
× labelDist
f
(G
r
, G
s
) + w
d
× directionalityDist
f
(G
r
, G
s
) + w
n
× nodeDist
f
(G
r
, Gs), where
Learning individual distance threshold for each event pattern
In the original design of our system [4], a unified subgraph distance threshold is assigned to all patterns of the same event type. Since the encoded graphs are different across the patterns, it is difficult for an event type-wise, batch threshold to precisely capture the graph variation tolerance of each pattern. Thus, we conjecture that an individual threshold would be more appropriate to regulate the subgraph retrieval quality of each pattern, thus improving the event extraction precision.
For patterns encoding lower order events, i.e., events that do not contain nested sub-events, learning a customized threshold is straightforward because their prediction results can be individually assessed. For a given threshold range, we can iteratively search for a threshold leading to the maximum performance of a pattern. The threshold is updated only if the current value results in a larger number of correct event predictions and an equivalent or better prediction precision. To alleviate the potential overfitting problem, a held-out data set is used to validate the candidate threshold before finalizing each update.
The same approach, however, cannot be applied to patterns encoding higher level events as individually measuring their performance is not feasible. Patterns nested with lower order sub-events depend on the corresponding lower order patterns, while patterns cascaded with higher order sub-events rely on all the patterns involved in the downstream, nested structures. Instead of tracing the hierarchical event correlations to evaluate each higher order pattern, we adopted a holistic approach to learn individual thresholds using a genetic algorithm (GA) [26] that automatically determines the values for higher order patterns by evaluating the entire event pattern set.
Our GA works with a population of potential threshold settings. Given a threshold range, the GA simultaneously assigns a candidate threshold value to each higher order pattern. The fitness function of GA evaluates the performance of the whole pattern set under the current threshold settings. The individually learned thresholds of lower order patterns remains untouched in the GA and the events produced by them serve as potential arguments to contribute to the functioning of higher level patterns. The GA iterates the fitness function and eventually returns a threshold setting that maximizes the F-score on the training data. Algorithm 1 formalizes our approach for learning the individual distance threshold for event patterns.
When evaluating pattern performance under different threshold settings, graph matching between patterns and sentences is performed only once with an assignment of the maximum candidate threshold to all patterns. By maintaining information on event predictions and corresponding pattern thresholds together, performance of various threshold settings can be efficiently computed. This is important for the GA especially when a large number of generations or population size is specified.
Algorithm 1 Pattern Threshold Learning Algorithm
Input: Dependency graphs of training and held-out sentences Gt and Gh; A finite set of event patterns P = {p1, p2, · · ·, p
i
, · · ·}, composed of lower order pattern subset P
i
and higher order subset P
h
; A predefined threshold value search range V = (v
min
, v
i
, · · ·, v
max
).
Output: A finite set of thresholds for patterns T = {t1, t2, · · · , t
i
, · · ·}.
1: for all p
i
∈ P
l
do
2: for all v
i
∈ V do
3: if updateSinglePattern(p
i
, v
i
, G
t
, G
h
) is satisfied then
4: ti← vi
5: //updateSinglePattern() evaluates the individual performance of p
i
with threshold v
i
, and
6: //t
i
updated only if v
i
results in more correct predictions and an equivalent or better precision
7: T
h
← geneticAlgorithm(P, T
l
, G
t
, V )
8: //geneticAlgorithm() undergoes procedures of selection, crossover and mutation, and returns an optimized threshold setting T for P by evaluating P as a whole
9: return T
Pattern set optimization by empirical risk minimization algorithm
The original pattern set optimization module [4] measures the prediction precision of patterns, and iteratively eliminates patterns whose precision is lower than an empirical threshold. We consider that the optimal event pattern set should satisfy the following three criteria: (1) maximum number of matches; (2) fewest number of prediction errors; and (3) least redundancy in patterns. Obviously, these criteria cannot be met simultaneously. Considering that the total number of prediction matches by the pattern set has been decided when the individual threshold of each pattern is learned, our optimization task becomes one of finding the best balance between the criteria (2) and (3).
We implemented the well-known empirical risk minimization (ERM) principle [8, 27] to optimize the event pattern set by balancing prediction errors on training data against regularization on the overall redundancy of the pattern set. The objective function of our problem is shown in in Eq.(3).
(3)
E(P, G) in Eq.(4) models the prediction errors including both wrongly predicted and missed events, produced by a pattern set P evaluated against the gold annotation G.
(4)
C
p
accumulates the information redundancy of each p
i
∈ P , measured by the percentage of non-essential nodes in p
i
, and λ is a regularization parameter that determines the degree of the penalty on the total redundancy.
Therefore, given an input pattern set P, our optimization problem is to find a pattern set P* ⊂ P, which satisfies , where P' is a subset of P. Clearly, minimizing f (P ) prefers compact and effective patterns encoding event arguments in an adjacent context, and penalizes the redundant information in complex patterns.
For our problem, a greedy backward elimination feature selection method is implemented, in which each pattern is evaluated according to its impact on the entire pattern set P, and the one whose removal incurs the largest reduction in f (P) is removed in each iteration. The optimization terminates when f (P) cannot be further reduced. Algorithm 2 shows the detailed procedure.
With λC
p
regularizing the optimization, the final set P*may not be the best pattern set in terms of minimizing the prediction errors on training data, but has better generalizability on unseen data.
Algorithm 2 ERM-based Pattern Set Optimization Algorithm
Input: A finite set of event patterns P = {p1, p2, · · ·, p
i
, · · ·}, where the distance threshold t
i
of p
i
is fixed.
Output: An optimized pattern set P*.
1: P
c
← P // P
c
is the current pattern set
2: while P
c
is not empty do
3: compute f (P
c
)
4: maxGain = 0
5: for all p
i
∈ P
c
do
6: P
t
← P
c
− {p
i
}
7: Δf = f (P
c
) − f (P
t
)
8: if Δf > maxGain then
9: maxGain = Δf
10: p* ← p
i
11: if maxGain ≤ 0 then
12: go to Line 14
13: P
c
← P
c
− {p*}
14: P* ← P
c
15: return P*