To date, most approaches to the BioNLP event extraction task [1, 2] use a single model to produce their output. However, model combination techniques such as voting, stacking, and reranking have been shown to consistently produce higher performing systems by taking advantage of multiple views of the same data [3–6]. System combination essentially allows systems to regularize each other, smoothing over the artifacts of each (c.f. [7, 8]). To the best of our knowledge, the only previous example of model combination for the BioNLP shared task was performed by [1]. Using a weighted voting scheme to combine the outputs from the top six systems, the task organizers obtained a 4% absolute F1 improvement over the best system used in isolation.

In this paper, we explore several model combination strategies. We aim to uncover the answers to these related questions: Which strategies are effective and why? How do the outputted event structures change after performing model combination? Finally, are there systematic errors which can be corrected to improve performance further?

We show that using a straightforward model combination strategy on two competitive systems (*base models*) produces a new system with substantially higher accuracy. This is achieved with the framework of stacking: a *stacking* model uses the output of a *stacked* model as additional features. To put the results in perspective, we also experiment with two simpler model combination techniques where systems are run independently and their outputs are combined via *union* or *intersection*.

Our base models are the UMass [9] and Stanford [10] event extractors. We initially considered combining these models using voting and reranking strategies. However, it seemed that given the performance gap between the two models, the best option was to include the predictions from the Stanford system into the UMass system (e.g., as in [7]). This has the advantage that one model (UMass) determines how to integrate the outputs of the other model (Stanford) into its own structure. In the case of reranking or voting, the combined model is required to output a structure constructed from the structures produced by the input models. In other words, each portion of the resulting structure originates from at least one of the base models. With stacking, one model determines how to integrate the outputs of the other model and the resulting structure can contain novel constructions. However, as it turns out, these novel constructions have low precision in our case. This can be understood using the same intuition that underlies the voting or union strategies - if a structure has been produced by multiple independent models, it is more likely to be correct. Novel events resulting from stacking have essentially been produced by neither base model and thus tend to be inaccurate. We show that by removing these novel events from our output, our state-of-the-art results can be improved further.

### The BioNLP shared task

The BioNLP shared task involves extracting a set of biomolecular events from natural language text in a given document (typically an abstract from a biomedical journal). By biomolecular events, we mean a change of state of one or more biomolecules. More concretely, let us consider part (a) of Figure 1. We see a snippet of text from a biomedical abstract and the three events that can be extracted from it. We will use these to characterize the types of events we ought to extract, as defined by the BioNLP 2009 and 2011 shared tasks. Note that for the shared task, entity mentions (e.g., proteins) are given by the task organizers and hence do not need to be extracted.

The event E1 in the figure refers to a *Phosphorylation* of the TRAF2 protein. It is an instance of a set of *simple events* that describe changes to a single gene or gene product. Other members of this set are: *Gene expression*, *Transcription*, *Localization*, and *Catabolism*. Each of these events has to have exactly one THEME, the protein whose state change is described. A labeled edge in Figure 1a shows that TRAF2 is the THEME of E1.

Event E3 is a *Binding* of TRAF2 and CD40. *Binding* events are special in that they may have more than one THEME, as there can be several biomolecules associated in a binding structure. This is in fact the case for E3.

In the top-center of Figure 1a we see the *Regulation* event E2. Such events describe regulatory or causal relations between events. Other instances of this type of events are: *Positive Regulation* and *Negative Regulation*. Regulations must have exactly one THEME; this THEME can be a protein or, as in our case, another event. *Regulation* s may also have zero or one CAUSE arguments that denote events or proteins which trigger the *Regulation*.

In the BioNLP shared task, we are also asked to find *anchor* (sometimes called *trigger* or *clue*) tokens for each event. These tokens ground the event in text and allow users to quickly validate extracted events. For example, the anchor for event E2 (a *Regulation* event) is "inhibit," as indicated by a dashed line.

Instead of directly working with the event representation in Figure 1a, both the UMass and Stanford systems extract labeled graphs in the form shown in Figure 1b. The vertices of this graph are the anchor and protein tokens. A labeled edge from an anchor *e* to a protein token *p* with role label *r* indicates that there is an event with anchor *e* for which the protein *p* plays the role *r*. An edge with role *r* from anchor *e* to anchor *e*' means that there is an event at *e*' that plays the role *r* for an event at *e*. This representation is used by the UMass system to define extraction as a compact optimization problem. A related representation is used by the Stanford system to tackle extraction as dependency parsing (see the Stacked Model section for details). If a graph can be drawn on a plane without crossing edges, we say that the graph is *projective* (sometimes referred to as a *planar graph*). Figure 2 shows examples of projective graphs while Figure 1 contains an example of a non-projective graph. We define the *non-projectivity* of a graph as the number of crossing edges in it. For more details about mapping back and forth between events and labeled graphs, we point the reader to [9, 11, 12].

The BioNLP 2009 shared task [1] consists of a single domain, Genia (GE) while the BioNLP 2011 shared task [2] expands the Genia domain and adds two additional domains, Epigenetics and Post-translational Modifications (EPI) and Infectious Diseases (ID) ([13–15], respectively). Our experiments in this paper are over the 2011 shared task corpora.

### Model combination approaches

Our primary approach consists of a *stacking model* that uses the predictions of a *stacked model* as features. In the following sections, we briefly present both the stacking and the stacked model and some possible ways of integrating the stacked information. We also describe two simpler model combination techniques (intersection and union) for comparison.

### Stacking model

As our stacking model, we employ the UMass extractor [16]. It is based on a discriminatively trained model that jointly predicts anchor labels, event arguments and protein pairs in bindings. We will briefly describe this model but first introduce three types of binary variables that will represent events in a given sentence. Variables *e*_{
i,t
} are active if and only if the token at position *i* has the label *t*. Variables *a*_{
i,j,r
} are active if and only if there is an event with anchor *i* that has an argument with role *r* grounded at token *j*.

In the case of an entity mention, this means that the mention's head is *j*. In the case of an event, *j* is the position of its anchor. Finally, variables *b*_{
p,q
} indicate whether or not two entity mentions at *p* and *q* appear as arguments in the same *Binding* event.

Two parts form our model: a scoring function, and a set of constraints. The scoring function over the anchor variables **e**, argument variables **a** and *Binding* pair variables **b** is

s(\mathbf{e},\mathbf{a},\mathbf{b})\underset{\xaf}{\underset{\xaf}{\text{def}}}{\displaystyle \sum _{{e}_{i},t=1}{s}_{\text{T}}(i,t)+{\displaystyle \sum _{{a}_{i,j,r}=1}{s}_{\text{R}}(i,j,r)}+}{\displaystyle \sum _{{b}_{p,q}=1}{s}_{\text{B}}(p,q)}

with local scoring functions {s}_{\text{T}}(i,t)\underset{\xaf}{\underset{\xaf}{\text{def}}}\u3008{\mathbf{w}}_{\text{T}},{\mathbf{f}}_{\text{T}}(i,t)\u3009, {s}_{\text{R}}(i,j,r)\underset{\xaf}{\underset{\xaf}{\text{def}}}\u3008{\mathbf{w}}_{\text{R}},{\mathbf{f}}_{\text{R}}(i,j,r)\u3009 and {s}_{\text{B}}(p,q)\underset{\xaf}{\underset{\xaf}{\text{def}}}\u3008{\mathbf{w}}_{\text{B}},{\mathbf{f}}_{\text{B}}(p,q)\u3009.

Our model scores all parts of the structure in isolation. It is a joint model due to the nature of the constraints we enforce: First, we require that each active event anchor must have at least one THEME argument; second, only *Regulation* events (or *Catalysis* events for the EPI track) are allowed to have CAUSE arguments; third, any anchor that is itself an argument of another event has to be labeled active, too; finally, if we decide that two entities *p* and *q* are part of the same *Binding* (as indicated by *b*_{
p,q
} = 1), there needs to be a *Binding* event at some anchor *i* that has *p* and *q* as arguments. We will denote the set of structures (**e**, **a**, **b**) that satisfy these constraints as .

Stacking with this model is simple: we only need to augment the local feature functions **f**_{T} (*i*, *t*), **f**_{R} (*i*, *j*, *r*) and **f**_{B} (*p*, *q*) to include predictions from the systems to be stacked. For example, for every system *S* to be stacked and every pair of event types (*t*', *t*_{
S
}) we add the features

{f}_{S,{t}^{\prime},{t}_{S}}\left(i,t\right)=\left\{\begin{array}{cc}1\hfill & {h}_{S}\left(i\right)={t}_{S}\wedge {t}^{\prime}=t\hfill \\ 0\hfill & \mathsf{\text{otherwise}}\hfill \end{array}\right.

to **f**_{T} (*i*, *t*). Here *h*_{
S
} (*i*) is the event label given to token *i* according to *S*. These features allow different weights to be given to each possible combination of type *t*' that we want to assign, and type *t*_{
S
} that *S* predicts.

Inference in this model amounts to maximizing *s* (**e**, **a**, **b**) over . Our approach to solving this problem is dual decomposition [17, 18]. This technique exploits the fact that while inference in the full problem may be intractable, it usually contains tractable subproblems for which efficient optimization algorithms exist. In dual decomposition, these algorithms are combined in a message passing scheme that often finds the global optimum of the full model. When a global optimum is found, dual decomposition also provides guarantees that prove the optimality of this solution.

For our event extraction model we divide the argmax problem into three subproblems: (1) finding the best anchor label and set of outgoing edges for each candidate anchor; (2) finding the best anchor label and set of incoming edges for each candidate anchor; and (3) finding the best pairs of entities to appear in the same *Binding*. For all of these problems, efficient algorithms can be derived [9].

For learning the parameters **w** of this model, we employ the online-learner MIRA [19]. MIRA iterates over the training data and compares the gold solution with the current best solution according to **w**. If both solutions disagree, **w** is adapted such that the gold solution would win with sufficient margin if the problem was to be solved again. We refer the reader to [16] for further details on both inference and learning.

### Stacked model

For the stacked model, we use a system based on an event parsing framework [10, 20] referred to as the Stanford model in this paper. A high level description of the system relevant to the experiments in this paper follows. To train the Stanford model, first event structures are projected to dependency trees in a process similar to that in Figure 1b. These dependency trees are tree-rooted dependency graphs where nodes are event anchors or entities and the labeled, directed edges are relations, e.g., THEME and CAUSE. This projection eliminates some of the more complex aspects of event structures which cannot be captured easily in dependency trees, primarily events or entities with multiple parents. Words that do not take part in any events are removed in the dependency trees and multiword anchors of events are replaced with their syntactic heads. An example of this conversion can be seen in Figure 2.

After conversion, the dependency trees are parsed using an extension of MSTParser [21, 22] which includes event parsing-specific features. To parse, MSTParser creates a complete graph with entities and event anchors as nodes. For each edge in the complete graph, MSTParser assigns a score using the features along that edge and the feature weights learned from training. At this point, the highest scoring parse (a subgraph of the complete graph which forms a tree) can be *decoded* using several possible algorithms. For example, the algorithm that gives MSTParser its name is the maximum-spanning tree algorithm which searches for a tree that spans all nodes in the graph and obtains the highest sum of edge scores. Once parsed, the resulting dependency tree is converted back to event structures. Training MSTParser involves learning feature weights which separate correct edges from incorrect edges during parsing.

Of particular interest to this paper are the four possible decoders in MSTParser since they result in four different models. These decoders come from combinations of feature order (first or second) and whether the resulting dependency graph is required to be projective. First-order features are features taken from a single edge (including the nodes at each end of the edge) while second-order features include features over two adjacent siblings along with their parent. Non-projective decoders would seem to be useful for this task. In Genia, 20.8% of the documents contain at least one non-projective arc (7.9% of the sentences and 2.9% of the overall dependencies [10]). This portion of the data can only be captured by non-projective decoders.

For brevity, the second-order non-projective decoder is abbreviated as '2N', first-order projective decoder as '1P,' etc. When referring to Stanford models, we always specify its decoder. Each decoder presents a slightly different view of the data and thus has different model combination properties. Projectivity constraints are not captured in the UMass model so these decoders incorporate novel information. Drawing on techniques from statistical constituency parsing [23, 24], we employ a reranking framework to further improve performance and capture global features of event structures. The existing features are restricted to functions of a single edge in the first-order model and two adjacent siblings in the second-order model. However, some phenomena of event structures span larger structures (e.g., event anchors and all their immediate children or the number of THEME relations attached to a specific event). To switch to a reranking framework, we extend the decoders to *n*-best decoders which return the *n* highest scoring parses for each sentence (an *n*-best list) rather than just the single highest scoring parse. Note that our non-projective decoders have only approximate *n*-best decoders (exact inference for the 2N decoder is NP-complete [25]) resulting in suboptimal reranker models in some cases. The reranker rescores each parse in the *n*-best list and returns the highest scoring parse. These scores are based on features of the global event parsing structure as well as including metadata about the parse (e.g., the MSTParser's parsing score). The reranker can be also used for model combination when given the output from multiple *n*-best lists. In this case, unique parses are merged and the original number of decoders producing the parse and the scores from the decoders are added to the parse's metadata. While the primary focus of this paper is on using stacking for model combination, a small number of experiments study the performance of using the reranker for model combination.

#### Using the Stanford model as a stacked model

The projective Stanford models are helpful in a stacking framework since they capture projectivity which is not directly modeled in the UMass model. Of course, this is also a limitation since actual BioNLP event graphs are DAGs, but the Stanford models perform well considering these restrictions. Additionally, this constraint forces the Stanford model to provide different (and thus more useful for stacking) results.

To produce stacking output from the Stanford system, we need its predictions on the training, development and test sets. For predictions on the test and development sets, we used models learned from the complete training set. Predictions over training data were produced using cross-validation. Obtaining predictions in this way helps to avoid scenarios in which the stacking model learns to rely on high accuracy at training time that cannot be matched at test time.

We used 19 cross-validation training folds for GE, 12 for EPI, and 17 for ID. To produce predictions over the test data, we combined the training folds with 6 development folds for GE, 4 for EPI, and 1 for ID.

Note that, unlike Stanford's individual submission in the BioNLP 2011 shared task [26], the stacked models in this paper do not use the reranker. This is because it would have required making a separate reranker model for each cross-validation fold.

Training the stacking model took about two hours on a 16 core machine. The stacked model needed about three hours on a single core machine for each fold. Since the stacking model and each fold of the stacked model can be trained in parallel, the overall training time is about five hours if sufficient cores are available.

### Intersection and union

We investigate two baseline techniques for model combination: intersection and union. Both of these are similar to their standard set theory operations except that instead of using strict equality for events, we allow events to be equal if they match according to the BioNLP approximate recursive scoring metric.