New challenges for text mining: mapping between text and manually curated pathways.

BACKGROUND
Associating literature with pathways poses new challenges to the Text Mining (TM) community. There are three main challenges to this task: (1) the identification of the mapping position of a specific entity or reaction in a given pathway, (2) the recognition of the causal relationships among multiple reactions, and (3) the formulation and implementation of required inferences based on biological domain knowledge.


RESULTS
To address these challenges, we constructed new resources to link the text with a model pathway; they are: the GENIA pathway corpus with event annotation and NF-kB pathway. Through their detailed analysis, we address the untapped resource, 'bio-inference,' as well as the differences between text and pathway representation. Here, we show the precise comparisons of their representations and the nine classes of 'bio-inference' schemes observed in the pathway corpus.


CONCLUSIONS
We believe that the creation of such rich resources and their detailed analysis is the significant first step for accelerating the research of the automatic construction of pathway from text.


Background
Originally created as a graphical depiction of biological knowledge, the pathway has developed into a way of organizing biological knowledge [1,2]. Pathways are becoming increasingly important for bio-medical research, since they represent collectively attested interpretations of a large number of facts scattered throughout literature. As such, Text Mining (TM) tools that facilitate the construction and maintenance of pathway knowledge bases have become indispensable tools for biologists to from The Second International Symposium on Languages in Biology and Medicine (LBM) 2007 Singapore. 6-7 December 2007 manage the ever-increasing quantity of biological literature.
A few TM systems have been developed for automatic bionetwork construction by extracting binary interactions between proteins or genes [3][4][5][6]. While the resultant networks appear to be pathways, they do not represent any coherent interpretations of the reported facts. To transform the results of automatically constructed networks to pathways seems to require further efforts which emulate the interpretations of biologists, including inferences based on biological background knowledge.
In this study, we took on a very different approach from previous works. We first examined how a biologist would construct a pathway from a given set of articles by recording which sentences in the articles the biologist found useful for the construction of the portions of a pathway. Then, we formulated what difficulties TM techniques should resolve in order to facilitate a manual curation process for pathways. The results show that pathway construction involves much more challenging tasks for the current TM technology than we had initially assumed.
The main challenges are classified into three groups: (1) the identification of the mapping position of a specific entity or reaction in a given pathway, (2) the recognition of the causal relationships among multiple reactions, and (3) the formulation and implementation of required inferences based on biological domain knowledge. Inferences that biologists make to associate information in text with pathway (concrete representation of interpretation) seem quite different from deduction. Certain expressions in text trigger inferences, which are abductive in nature, to relate them with specific interpretations. Though this type of inference is pervasive in the process of understanding text and unspecific to the biology domain, we call them "bio-inferences" for the sake of brevity.
In this paper, we report a detailed corpus study, which leads to the formulation of these three challenges. We expect the results will contribute to the design of an intelligent TM tool kit for pathway construction and maintenance. The study has been conducted by adding new annotations to a subset of the GENIA corpus [7,8], which associates sentences with corresponding portions in a pathway, and compares them with event annotations independently made to the GENIA corpus.

PPI network and manually curated pathways
To demonstrate the differences between manually constructed pathways and PPI networks, we take the Toll-like receptor (TLR) pathway as a typical example of a manually constructed pathway [9]. The pathway, which was con-structed based on 411 publications and is one of the largest of its kind, consists of 652 nodes and 444 links. We associated each of the 340 protein nodes from among the 652 nodes with a set of accession numbers, and then retrieved a set of biological events reported in MEDLINE on pairs of proteins corresponding to the accession number sets [10].
The set of retrieved pairs of proteins from MEDLINE can be used as basis for constructing a PPI network. Although recognition of protein pairs which appear in events contains errors, we can see significant differences of a PPI network to be constructed and a manually curated pathway. Table 1 shows how single proteins in the extracted pairs of proteins correspond to nodes in the manually constructed pathway. On the other hand, single links in the PPI network are expanded into paths in the pathway. That is, sequences of links appear to associate the pairs which are directly linked in the extracted pairs (all distances between nodes in a PPI network are 1). The links in the pairs corresponding to different types of events show distinctly different distribution behaviours in terms of the length of the corresponding paths in the pathway. Figure 1 shows the distribution of distances between nodes in the TLR pathway of the binding and positive regulation events.
In order to see the reasons why such significant discrepancies appear between a PPI network and a pathway, we constructed a much smaller pathway, and carefully annotated a corpus based on the GENIA corpus.

GENIA-pathway corpus and a NF-kB pathway
Among 1999 Medline abstracts in the GENIA corpus, 561 abstracts are indexed with the Mesh term, NF-kB. We call this subset the GENIA NF-kB pathway corpus, or the pathway corpus. Figure 2 shows a NF-kB pathway constructed manually based on the pathway corpus. It consists of 10 reactions in which 15 entities (proteins, genes, and their oligomers) participate. An ID is assigned to each reaction in the pathway with a set of evidence sentences for it. The number of evidence sentences is shown in the square bracket in Figure 2. Table 2 shows all of the evidence sentences for the reaction R6. Such an association between a reaction and its evidence sentences illustrates which sentences are judged by the specialist to be important for the construction of the portion of the pathway. Note that, as seen in Table 2, diverse expressions in text denote the same reaction, R6.
As a subset of the GENIA corpus, every sentence in the pathway corpus is accompanied by manually created event annotations. Figure 3 shows examples of event annotations, with the corresponding reactions in the pathway. This dual annotation of reactions in the pathway and events in text allows us to see not only how events in text are mapped to reactions in the pathway, but also what events or reactions implicit in text have to be inferred by bio-inferences.

Context dependency of the mapping between text and a pathway
The first presented challenge is coping with the difficulty of identifying portions in a pathway relevant to expressions in text. The same biological entity (protein, molecule, etc.) may appear in more than one place in a pathway, since it is in different chemical states and thus possesses different properties. Unlike mapping proteins to their accession numbers in the traditional task setting [11], mapping named entities in text to nodes in a pathway is highly context dependent. Nodes in a pathway represent instances of a biomolecule, and single accession numbers are assigned to biomolecules. Table 3 depicts all of the entities in Figure 2. Among these entities, I4 and I5 are instances of the same biomolecule, the p50/p65, while I7, I8, and I9 are instances of the biomolecule, p50/p65/ IkBa.
In text, an entity, that is, an instance of a biomolecule, is not represented as such. It is described by a reference to the corresponding bio-molecule (continuant) together with the surrounding textual context. One has to enumerate, from a given textual context, distinct biological states and then locate in these states the biological entities denoted by the biomolecule expression. Let us see the occurrences of NK-kB in the following two sentences: The active nuclear form of the NF-kappa B transcription factor complex is composed of two DNA binding subunits, NF-kappa B p65 and NF-kappa B p50, … (1493333-S2).

Transcription factor NF-kappa B (p50/p65) is generally localized to the cytoplasm by its inhibitor I kappa B alpha (8319912-S2).
In the first sentence, the biological state in which NF-kB appears, is given by the preceding noun phrase "active nuclear form," and we can thus associate the NF-kB reference with the entity I5 in Table 3. On the other hand, since the second sentence indicates that the NF-kB resides in the cytoplasm and interacts with IkBa, the NF-kB in this sentence has to be mapped to I7, which is an instance of the NF-kB/IkBa complex.
While these two sentences demonstrate that the same named entity expression 'NF-kB' must be mapped to different nodes in a pathway depending on the biological contexts in which they appear, the next example shows The distribution of distances between the nodes Figure 1 The distribution of distances between the nodes. The pair of nodes which is directly linked in the PPI network (all distances between the nodes are 1) seem to be associated through a sequences of links. The links in the PPI network corresponding to the different types of events show distinctly different distribution behaviours in terms of the length of corresponding paths in the pathway. The distance between nodes in the binding event tends to be close to 1, while the distance between nodes in the positive regulation tends to be distributed widely. that an entity in one sentence must be mapped to different nodes in a pathway;

Stimulation of cells leads to a rapid phosphorylation of I Kappa B alpha, which is presumed to be important for the subsequent degradation. (7499266-S3)
This sentence makes references to two events: phosphorylation (R3), and degradation (R5), which occur in this order. Since a single event contrasts two biological states, the two events introduce three distinct states: before, between and after the two events. While IkBa appears only once in the sentence, it has three different states (un-phosphorylated, phosphorylated, and degraded product), and was thus mapped to three instances, which correspond to the nodes in a pathway. Figure 4 illustrates the difference between representations in text and a pathway for this sentence.
As exemplified here, for pathway-text association, we need to (1) recognize bio-molecule references from text (recognition of biomolecules), (2) capture biological contexts from the textual context (enumeration of biological con-texts), and (3) find corresponding entities in a pathway (identification of instances in biological contexts). While step (1) is the classic Named Entity Recognition (NER) problem, steps (2) and (3) are newly introduced obstacles for the pathway-text association task.

Integration of scattered information into a pathway
On the contrary, in some cases, entities in different texts may have to be mapped to the same node in a pathway. Table 2 shows all of the evidence sentences from the GENIA event corpus associated with the reaction R6. While at a glance, these evidence sentences appear to be different, they all refer to the same event: the translocation of NF-kB from the cytosol to the nucleus. In order to integrate information scattered over different articles into a pathway, these diverse surface expressions should be mapped to a single link (R6) in the pathway. All the typographic variants of NF-kappa B as well as its synonyms (p50-p60, etc., shown in bold), have to be recognized. Furthermore, the Localization event is described by using diverse predicates: translocate, take up, migrate and move, in diverse syntactic constructions (underlined). Simple information extraction techniques based on co-occur- Textual evidence associated to the reaction R6 of the GENIA version NF-kB pathway. They are all extracted from the GENIA corpus and stored as GENIA pathway corpus. The PMID and SID (sentence ID) of the source of the textual evidence are given in the first and second column respectively. The reaction R6 represents localization of NF-kB from cytoplasm to nucleus. The text expressions referring to the NF-kB, the localization event and the location information are highlighted in bold, underlined, and italicized respectively.
rences cannot deliver such fine-tuned recognition of events, and structure-based IE would be indispensable [6,12,13].

New type of disambiguation problem
Similarly, correspondence events reported in different papers and reactions in a pathway are not straightforward. Consider Figure 5, which shows a detailed interpretation of the reaction sequence, R3~R6. The main concern of the pathway is the state change of the NF-kB from the inactive to the active form ( Figure 5(a)). The whole process is broken down into two sub-processes: the dissociation of NF-kB from IkBa, and the localization of NF-kB into the nucleus ( Figure 5(b)). The dissociation process is described in more detail by the IkBa degradation process ( Figure 5(c)). These three sequences constitute the whole NF-kB pathway GENIA version Figure 2 NF-kB pathway GENIA version. NF-kB pathway GENIA version is comprised of 16 entities, which are organized as 10 reactions. A brief explanation of the pathway is as follows: p65 binds to p50 to form a complex NF-kB (R1). NF-kB binds to IkBa in the cytosol (R2) that is rapidly phosphorylated on serine 32 and 36 (R3), ubiquitinated (R4), and degraded by 26S proteasome (R5). Freed NF-kB translocates to the nucleus (R6) and initiates the transcription of genes by binding to the kB site in their promoter region (R7). Because either IKBA or NFKB1 has a kB site in its promoter, NF-kB transcribes it to produce protein IkBa (R8) or p105 (R9), respectively. p105 is a precursor of p50 and constitutively processed to produce p50 (R10). The number of evidence sentence supporting the reaction is shown in parenthesis next to the ID of a reaction. P: phosphorylated, S: serine, Ub: ubiquitinated. Since the first sentence focuses on external factors (NAC, IL-4 and anti-CE40 mAb) which cause the state change of NF-kB from inactive to active, NF-kB activation in this sen-tence refers the entire NF-kB activation process in Figure  5(a). In contrast, the second sentence discusses the process at the micro-level, and the NF-kB activation refers only to the localization portion in Figure 5(b), which is the moment at which the activation is accomplished. Thus, all aspects of information discussed in different articles must be integrated into a pathway. The pathway ( Figure 2) is a result of such an effort by a biologist who has constructed the model pathway.
In Figure 5, the nodes in the same biological contexts are vertically aligned. This alignment assumes the basic knowledge of the biologist including: "the inactive NF-kB is equated with the NF-kB, which is bound to IkB in the cytoplasm," and inferences such as, "the presence of the NF-kB which is bound to IkB in the cytoplasm also implies the presence of IkBa itself, which is bound to NF-kB in the cytoplasm" (symmetry). The three leftmost nodes aligned vertically in Figure 5 (the inactive NF-kB Screenshot of one example from GENIA event annotation and corresponding reaction IDs Figure 3 Screenshot of one example from GENIA event annotation and corresponding reaction IDs. One example from the GENIA pathway corpus is shown. Each sentence in the GENIA pathway corpus comes with event annotations and is mapped to the corresponding reactions in the NF-kB pathway. Event IDs E13, E14, E15 are the adopted events to the NF-kB pathway. Their mapped reaction IDs are also shown by the orange arrow boxes. re3 re4 re5 and the NF-kB and IkBa, which are bound to each other) are all equivalent and correspond to the entity I7 in Table  3.

Constraints on sequences of multiple reactions
The second difficulty concerns the recognition of mutual relationships among multiple reactions. While the research on Information Extraction (IE) in the bio-medical domain has focused on the extraction of events or relations independently of the contexts in which they appear [14][15][16], a pathway construction needs to extract their mutual relationships in the form of causal chains [17][18][19].
Since exact precedence relationships among events are not usually expressed explicitly, we can only extract constraints on causal chains. At the integration stage, we construct a pathway which satisfies all constraints gathered from the whole set of articles.
Explicit causal expressions in text such as, "A causes B," "A leads to B," and "A activates B," indicate constraints that "A precedes B". Note that these expressions do not necessarily imply direct causation. It is common that there exist intervening reactions in the pathway between A and B in these expressions. Moreover, human biologists infer much richer constraints on sequences of reactions from expressions in text, other than explicit causal expressions. Consider the following sentence:

A fraction of the phosphorylated form of I kappa B alpha remains physically associated with the NF-kappa B complex in vivo but is subject to rapid degradation, thereby promoting the nuclear translocation of the active NF-kappa B complex.
There are four events mentioned in this sentence: (1) the binding of NF-kappa B complex and I kappa B alpha (R2), (2) the phosphorylation of I kappa B alpha (R3), (3) the degradation of I kappa B alpha (R5), and (4) the translocation of NF-kappa B complex (R6). More importantly, from this sentence, a biologist can extract relational constraints among these four events.
From the verb 'remains', s/he understands that the event in the subject (R3) does not distort the event in the objective complement (R2), thereby s/he may infer a constraint on the event sequence; that is, that R2 precedes R3. Another constraint is inferred by 'is subject to.' It indicates that a state after reaction (R3) in the subject is easily caught by another reaction (R5) in the object. Therefore, the constraint that R3 precedes R5 is inferred. Lastly 'thereby' shows that the event in the foregoing sentence (R5) is followed by the event in the subsequent sentence (R6). By putting them together, she/he successfully constructed the portion of these events in the pathway in Fig  2.

Bio-inferences
The third difficulty comes from the use of biological domain knowledge. In general, at the stage of information integration in a pathway, biologists extensively use their background knowledge as well as the pathway at hand to infer pieces of information, which are implicit in text. We have to understand what cues in text trigger bio-inferences and what facts are to be inferred for pathway construction.
To elucidate the inference process, we classified the evidence sentences according to whether they contain direct expressions of the identified reactions or not (Table 4). Evidence sentences are classified as direct when they contain annotated events in the GENIA event annotation. The events in the GENIA corpus were independently anno- The second column shows the name of each entity given by the biologist who created the pathway model. The biomolecules corresponding to each entity is given in the third column. The last column gives the state of each entity, which is represented as a set of predicates.
tated by a group of annotators. The sentences judged as indirect do not contain such corresponding annotated events but nonetheless were identified as evidence sentences. Table 4 shows that evidence sentences were distributed evenly between direct and indirect ones. In the following, we identify what constitutes indirect cues and discuss what types of bio-inferences are triggered by them. The analysis reveals that highly domain-dependent cues trigger bio-inferences.

Classification of bio-inference for pathway construction from text
We analysed evidence sentences regarded as indirect to see whether specific textual cues exist to trigger the inference, and classified them according to their relationships with the inferred reactions. The analysis resulted in six major inference schemas shown in Table 5. An example sentence of each scheme is shown in Table 6.

1a/b. State(s) of entity(-ies) before or after reaction
From the existence of a certain entity (cue) the biologist inferred the associated reaction. Example cues of this class are precursor (1a, ex1) and heterodimer (1b, ex2) (shown in boldface in Table 6). In general, biologists know that a precursor should be processed to the mature protein and that a heterodimer represents two distinct proteins coming together. These cues trigger inference to assume a processing and binding event, respectively.

2a/b. Function (s) of entity(-ies) before or after reaction
The next class is the inference scheme based on the existence of a certain entity(-ies) with a specific function. The example cues of these classes are the high-affinity binding site (ex3) and retaining it in the cytoplasm (ex4), respectively. These indicate that a protein mentioned in the sentence has a specific function. The protein that has a binding function with some protein with high affinity will bind it, ten to one, and the one that has a function of Event representation in NL expression vs. pathway representation retaining another protein never displays its ability without binding to it. The evidence sentences of these two cues appear for R2.

3a/b. Influence of state or functional change(s) of entity(-ies)
This is the class of inferring a latent reaction from the influence caused by state or functional change. Examples of cue phrases of this type are 'serine, mutated' (3a, ex5), and 'proteasome inhibitor' (3b, ex6), respectively. Due to the state change of the serine mutation, the protein was The sentences for pathway representation of each reaction were distributed by a direct or indirect expression. The direct expressions were annotated by 'expected annotation class' using GENIA event annotation, while the indirect expressions were not. The latter needs some 'inference' by the biological domain knowledge to construct the corresponding pathway.
NF-kB activation process from different perspectives Figure 5 NF-kB activation process from different perspectives. A fragment (R3~R6) of the NF-kB lifecycle pathway in Figure 2 represents the process of NF-kB activation. (a), (b), and (c) illustrate a breakdown of the process from different perspectives: (a) from the perspective of the functional state of the NF-kB, (b) from the perspective of the molecular state of the NF-kB, and (c) from the perspective of the molecular state of the IkBa. Although (c) does not represent NF-kB directly, the state change of IkBa can be interpreted as a part of NF-kB activation process. It is reflected in the entities of the NF-kB/IkBa complex in the NF-kB pathway in Figure 1, which correspond to I7, I8 and I9 in Table 3.
not phosphorylated. On the flipside, biologists recognize the reaction that the protein is to be phosphorylated on serine if not mutated (R3). Treatment with a proteasome inhibitor changes the function of the proteasome, thereby inhibiting the degradation event of I kappa B alpha (R5).
Biologists also know I kappa B alpha is to be degraded by proteasome if not inhibited. Overall, this class of bioinference is an association of the reaction normally occurring if something (state or function) is not changed.

Related reaction
This class of cues is based on the strong association between two reactions. It infers a reaction unmentioned in text from an explicit reaction. The example of the linguistic cue phrase of this class is 'As observed with' ( Table  4, ex7). The example does not show that any reaction occurred to I kappa B alpha directly. However, because of the expression: 'As observed with,' and the biological knowledge that RelA is a transcription factor, and that I kappa B alpha is a protein produced from its gene through the transcriptional and translational event by a transcription factor, biologists recognize the existence of the reaction that RelA transcribes the mRNA of IkBa gene, and IkBa protein will be induced. Further, the gene expression event occurred to IkBa by RelA (R8). Thus, the linguistic cue of this class appears in the sentence structure.

Reverse reaction
This class is the inference from the reverse reaction. The example, ex8 describes the dissociation event that occurred between 'NF-kappa B' and 'I kappa B alpha.' The dissociation must be preceded by binding event (R2), 1 a ex1 p50 is translated as a precursor of 105 kDa. R10 b ex2 CD23-induced NF-kappaB is a heterodimer composed of p65/p50 subunits. R1 a ex3 RelA contains a high-affinity binding site for its cytoplasmic inhibitor, I kappa B alpha. R2 2 b ex4 I kappa B-alpha inhibits transcription factor NF-kappa B by retaining it in the cytoplasm. R 2 3 a ex5 When either serine-32 or serine-36 of I kappa B-alpha was mutated, the protein did not undergo signalinduced phosphorylation.

R3
b ex6 Pretreatment of the cells with the proteasome inhibitor N-Ac-Leu-Leu-norleucinal inhibits this ligandinduced degradation of the human I kappa B alpha protein.

R5
4 e x 7 As observed with I kappa B alpha, nuclear RelA stimulates p100 mRNA and protein expression. R8 5 ex8 Nuclear expression of NF-kappa B occurs after its induced dissociation from its cytoplasmic inhibitor I kappa B alpha.

R2
6 ex9 A failure to degrade IkappaB-alpha pretreated with PAO is not due to its inhibitory effect on proteasomal degradation.

R5
The example sentences of each class of 'bio-inference' scheme are shown with their inferred reaction IDs. The linguistic cues to infer some reaction are shown in boldface. because all proteins are produced as a single molecule. However, the reverse reaction does not always occur; whether the reverse reaction exists or not depends on the domain knowledge.

Characteristic of reaction
The last class is the inference scheme based on a qualifier to show a characteristic of a reaction. The example, ex9 shows the degradation event that occurred to 'IkappaBalpha' directly (R5), but does not indicate what caused the reaction, or how it was reaction. Biologists will pay attention to the cue qualifier 'proteosomal.' They know that, as biological domain knowledge, its substantive form 'proteasome' is a name of a huge protein complex that causes protein degradation. Thus, they can infer that the modifier should be the proteasome of the degradation event on IkBa from the qualifier 'proteasomal' that shows the characteristic of a reaction.

Discussion
Due to the complexity and size of the pathways being, and to be constructed, TM tools for facilitating their construction and maintenance become crucial for bio-medical research. At the same time, text retrieval based on pathways will become an effective means by which biologists may gain access to articles relevant to their interests. For all such scenarios for TM tools to be materialized, technical challenges involved in associating text with pathways have to be properly understood and formulated.
In this paper, we identified the three challenges. The first one, identification of the mapping position of a specific entity in a pathway, is beyond the traditional challenge tasks of accession number assignment for proteins [11] or protein-protein interactions (PPI) [16]. The resulting comparison between a PPI network and a pathway shows that the same biomolecules appear in several places in a pathway, thus making it necessary for us to associate biomolecular expressions in text with one of their occurrences in a pathway. The correspondence is highly context dependent.
The second challenge is the recognition of constraints on sequences of multiple reactions. While temporal sequences of events have been studied in the general domain, their techniques are mostly concerned with the treatment of temporal expressions [17]. On the other hand, as exemplified in this paper, we have to recognize causal relationships among events, which require inferences based on the domain knowledge as well as subtle textual cues in text, instead of explicit temporal expressions.
The third and the most difficult of the challenges is the problem involved in formulating and implementing bio-inferences. Half of all crucial evidence sentences do not contain an explicit description of events. The specialist inferred implicit events, and the inferences seem to be triggered by specific fragments of expressions in text. While logical or deductive inferences based on the domain knowledge certainly play a role in the process [20], the inferences here are more akin to associative and plausible inferences [21]. How to formulate and exploit such inferences in TM would be the greatest challenge of all.
The analysis given in this paper has been made on the pathway corpus, which has been built based on the GENIA corpus. Since the GENIA pathway corpus is a highly biased corpus [22], and the size is currently not large enough, we are now working on scaling up the pathway corpus by using full papers, and extending the corpus beyond the GENIA set. Figure 6 shows a fuller version of the NF-kB pathway being constructed by a collection of full papers from the whole Medline. To enlarge the pathway corpus as a rich resource, we will append annotations to the collected relevant sentences depending on whether the text is relevant, which reaction it indicates, whether the expression is 'direct' or 'indirect,' and its inference scheme in cases where the expression 'indirect.' The stored data will be also annotated by the GENIA method of event annotation. In the future we will be engaged in another kind of pathway such as epidermal growth factor receptor signalling pathway thus establishing a consistent method of the pathway annotation for the automatic building of a pathway from text.

Conclusions
In this study, we presented two new resources: pathway corpus and its corresponding NF-kB pathway, whose mapping among pathway and text is compared with annotated events. Based on a detailed corpus study, we formulated three challenges for TM tools. We also revealed that inferences triggered by highly domaindependent cues play a central role in recovering events implicit in text but crucial for text-pathway association. We believe the corpus-based study presented in this paper is an important first step for addressing the new TM technology for pathway construction.

Construction of a PPI network and its mapping to a pathway
The PPI network for the proteins in the TLR pathway is constructed by using an IE system [23]. The IE system uses a full parser [24] to reveal the semantic structures of sentences, and then applies simple rules to identify event types. The event type recognition is based on linguistic clues developed by our previous work [8]. A dictionary-based protein name recognizer is used to map the protein names to a set of accession numbers [10].
The distribution of path lengths for the binding and positive regulation events (Figure 1) are calculated after KO identifies the corresponding nodes in the pathway for each protein in the sentences in which the event is recognized. Moreover, the distribution of path lengths for pairs in all extracted events is calculated by assuming that the two nodes which give the shortest path are the nodes corresponding to the two proteins of a given pair. Therefore, the lengths tend to be estimated to be shorter than in actuality.

Event annotation and the pathway corpus
GENIA event annotation was made on half of the GENEA corpus [8,25], and consists of 1,000 Medline abstracts. It contains 9,372 sentences in which 36,114 events are identified. The annotation was carried out over the process of two years by a group of annotators (four to six biologists) and one coordinator (co-author of the paper, TO). In order to avoid inter-annotator discrepancy, the instructions were given to prevent them from making free inferences.