Skip to main content

New challenges for text mining: mapping between text and manually curated pathways



Associating literature with pathways poses new challenges to the Text Mining (TM) community. There are three main challenges to this task: (1) the identification of the mapping position of a specific entity or reaction in a given pathway, (2) the recognition of the causal relationships among multiple reactions, and (3) the formulation and implementation of required inferences based on biological domain knowledge.


To address these challenges, we constructed new resources to link the text with a model pathway; they are: the GENIA pathway corpus with event annotation and NF-kB pathway. Through their detailed analysis, we address the untapped resource, ‘bio-inference,’ as well as the differences between text and pathway representation. Here, we show the precise comparisons of their representations and the nine classes of ‘bio-inference’ schemes observed in the pathway corpus.


We believe that the creation of such rich resources and their detailed analysis is the significant first step for accelerating the research of the automatic construction of pathway from text.


Originally created as a graphical depiction of biological knowledge, the pathway has developed into a way of organizing biological knowledge [1, 2]. Pathways are becoming increasingly important for bio-medical research, since they represent collectively attested interpretations of a large number of facts scattered throughout literature. As such, Text Mining (TM) tools that facilitate the construction and maintenance of pathway knowledge bases have become indispensable tools for biologists to manage the ever-increasing quantity of biological literature.

A few TM systems have been developed for automatic bio-network construction by extracting binary interactions between proteins or genes [36]. While the resultant networks appear to be pathways, they do not represent any coherent interpretations of the reported facts. To transform the results of automatically constructed networks to pathways seems to require further efforts which emulate the interpretations of biologists, including inferences based on biological background knowledge.

In this study, we took on a very different approach from previous works. We first examined how a biologist would construct a pathway from a given set of articles by recording which sentences in the articles the biologist found useful for the construction of the portions of a pathway. Then, we formulated what difficulties TM techniques should resolve in order to facilitate a manual curation process for pathways. The results show that pathway construction involves much more challenging tasks for the current TM technology than we had initially assumed.

The main challenges are classified into three groups: (1) the identification of the mapping position of a specific entity or reaction in a given pathway, (2) the recognition of the causal relationships among multiple reactions, and (3) the formulation and implementation of required inferences based on biological domain knowledge. Inferences that biologists make to associate information in text with pathway (concrete representation of interpretation) seem quite different from deduction. Certain expressions in text trigger inferences, which are abductive in nature, to relate them with specific interpretations. Though this type of inference is pervasive in the process of understanding text and unspecific to the biology domain, we call them “bio-inferences” for the sake of brevity.

In this paper, we report a detailed corpus study, which leads to the formulation of these three challenges. We expect the results will contribute to the design of an intelligent TM tool kit for pathway construction and maintenance. The study has been conducted by adding new annotations to a subset of the GENIA corpus [7, 8], which associates sentences with corresponding portions in a pathway, and compares them with event annotations independently made to the GENIA corpus.


PPI network and manually curated pathways

To demonstrate the differences between manually constructed pathways and PPI networks, we take the Toll-like receptor (TLR) pathway as a typical example of a manually constructed pathway [9]. The pathway, which was constructed based on 411 publications and is one of the largest of its kind, consists of 652 nodes and 444 links. We associated each of the 340 protein nodes from among the 652 nodes with a set of accession numbers, and then retrieved a set of biological events reported in MEDLINE on pairs of proteins corresponding to the accession number sets [10].

The set of retrieved pairs of proteins from MEDLINE can be used as basis for constructing a PPI network. Although recognition of protein pairs which appear in events contains errors, we can see significant differences of a PPI network to be constructed and a manually curated pathway.

Table 1 shows how single proteins in the extracted pairs of proteins correspond to nodes in the manually constructed pathway. On the other hand, single links in the PPI network are expanded into paths in the pathway. That is, sequences of links appear to associate the pairs which are directly linked in the extracted pairs (all distances between nodes in a PPI network are 1). The links in the pairs corresponding to different types of events show distinctly different distribution behaviours in terms of the length of the corresponding paths in the pathway. Figure 1 shows the distribution of distances between nodes in the TLR pathway of the binding and positive regulation events.

Table 1 Distribution of one protein in the PPI to multiple nodes in the pathway.
Figure 1
figure 1

The distribution of distances between the nodes. The pair of nodes which is directly linked in the PPI network (all distances between the nodes are 1) seem to be associated through a sequences of links. The links in the PPI network corresponding to the different types of events show distinctly different distribution behaviours in terms of the length of corresponding paths in the pathway. The distance between nodes in the binding event tends to be close to 1, while the distance between nodes in the positive regulation tends to be distributed widely.

In order to see the reasons why such significant discrepancies appear between a PPI network and a pathway, we constructed a much smaller pathway, and carefully annotated a corpus based on the GENIA corpus.

GENIA-pathway corpus and a NF-kB pathway

Among 1999 Medline abstracts in the GENIA corpus, 561 abstracts are indexed with the Mesh term, NF-kB. We call this subset the GENIA NF-kB pathway corpus, or the pathway corpus. Figure 2 shows a NF-kB pathway constructed manually based on the pathway corpus. It consists of 10 reactions in which 15 entities (proteins, genes, and their oligomers) participate. An ID is assigned to each reaction in the pathway with a set of evidence sentences for it. The number of evidence sentences is shown in the square bracket in Figure 2. Table 2 shows all of the evidence sentences for the reaction R6. Such an association between a reaction and its evidence sentences illustrates which sentences are judged by the specialist to be important for the construction of the portion of the pathway. Note that, as seen in Table 2, diverse expressions in text denote the same reaction, R6.

Table 2 Textual evidence for the reaction, R6 of the NF-kB pathway
Figure 2
figure 2

NF- k B pathway GENIA version. NF-kB pathway GENIA version is comprised of 16 entities, which are organized as 10 reactions. A brief explanation of the pathway is as follows: p65 binds to p50 to form a complex NF-kB (R1). NF-kB binds to IkBa in the cytosol (R2) that is rapidly phosphorylated on serine 32 and 36 (R3), ubiquitinated (R4), and degraded by 26S proteasome (R5). Freed NF-kB translocates to the nucleus (R6) and initiates the transcription of genes by binding to the kB site in their promoter region (R7). Because either IKBA or NFKB1 has a kB site in its promoter, NF-kB transcribes it to produce protein IkBa (R8) or p105 (R9), respectively. p105 is a precursor of p50 and constitutively processed to produce p50 (R10). The number of evidence sentence supporting the reaction is shown in parenthesis next to the ID of a reaction. P: phosphorylated, S: serine, Ub: ubiquitinated.

As a subset of the GENIA corpus, every sentence in the pathway corpus is accompanied by manually created event annotations. Figure 3 shows examples of event annotations, with the corresponding reactions in the pathway. This dual annotation of reactions in the pathway and events in text allows us to see not only how events in text are mapped to reactions in the pathway, but also what events or reactions implicit in text have to be inferred by bio-inferences.

Figure 3
figure 3

Screenshot of one example from GENIA event annotation and corresponding reaction IDs. One example from the GENIA pathway corpus is shown. Each sentence in the GENIA pathway corpus comes with event annotations and is mapped to the corresponding reactions in the NF-kB pathway. Event IDs E13, E14, E15 are the adopted events to the NF-kB pathway. Their mapped reaction IDs are also shown by the orange arrow boxes.

Context dependency of the mapping between text and a pathway

The first presented challenge is coping with the difficulty of identifying portions in a pathway relevant to expressions in text. The same biological entity (protein, molecule, etc.) may appear in more than one place in a pathway, since it is in different chemical states and thus possesses different properties. Unlike mapping proteins to their accession numbers in the traditional task setting [11], mapping named entities in text to nodes in a pathway is highly context dependent. Nodes in a pathway represent instances of a biomolecule, and single accession numbers are assigned to biomolecules. Table 3 depicts all of the entities in Figure 2. Among these entities, I4 and I5 are instances of the same biomolecule, the p50/p65, while I7, I8, and I9 are instances of the biomolecule, p50/p65/IkBa.

Table 3 Entities in the NF-kB pathway, the GENIA version

In text, an entity, that is, an instance of a biomolecule, is not represented as such. It is described by a reference to the corresponding bio-molecule (continuant) together with the surrounding textual context. One has to enumerate, from a given textual context, distinct biological states and then locate in these states the biological entities denoted by the biomolecule expression. Let us see the occurrences of NK-kB in the following two sentences:

The active nuclear form of the NF-kappa B transcription factor complex is composed of two DNA binding subunits, NF-kappa B p65 and NF-kappa B p50, … (1493333-S2).

Transcription factor NF-kappa B (p50/p65) is generally localized to the cytoplasm by its inhibitor I kappa B alpha (8319912-S2).

In the first sentence, the biological state in which NF-kB appears, is given by the preceding noun phrase “active nuclear form,” and we can thus associate the NF-kB reference with the entity I5 in Table 3. On the other hand, since the second sentence indicates that the NF-kB resides in the cytoplasm and interacts with IkBa, the NF-kB in this sentence has to be mapped to I7, which is an instance of the NF-kB/IkBa complex.

While these two sentences demonstrate that the same named entity expression ‘NF-kB’ must be mapped to different nodes in a pathway depending on the biological contexts in which they appear, the next example shows that an entity in one sentence must be mapped to different nodes in a pathway;

Stimulation of cells leads to a rapid phosphorylation of I Kappa B alpha, which is presumed to be important for the subsequent degradation. (7499266-S3)

This sentence makes references to two events: phosphorylation (R3), and degradation (R5), which occur in this order. Since a single event contrasts two biological states, the two events introduce three distinct states: before, between and after the two events. While IkBa appears only once in the sentence, it has three different states (un-phosphorylated, phosphorylated, and degraded product), and was thus mapped to three instances, which correspond to the nodes in a pathway. Figure 4 illustrates the difference between representations in text and a pathway for this sentence.

Figure 4
figure 4

Event representation in NL expression vs. pathway representation. (a) Event annotation to the sentence “Stimulation of cells leads to a rapid phosphorylation of I kappa B alpha, which is presumed to be important for the subsequent degradation,” and (b) the pathway representation corresponding to it. In (a), the black and red arcs link an event with its theme and cause respectively. In (b), the black arcs link two entities representing change of the state.

As exemplified here, for pathway-text association, we need to (1) recognize bio-molecule references from text (recognition of biomolecules), (2) capture biological contexts from the textual context (enumeration of biological contexts), and (3) find corresponding entities in a pathway (identification of instances in biological contexts). While step (1) is the classic Named Entity Recognition (NER) problem, steps (2) and (3) are newly introduced obstacles for the pathway-text association task.

Integration of scattered information into a pathway

On the contrary, in some cases, entities in different texts may have to be mapped to the same node in a pathway. Table 2 shows all of the evidence sentences from the GENIA event corpus associated with the reaction R6. While at a glance, these evidence sentences appear to be different, they all refer to the same event: the translocation of NF-kB from the cytosol to the nucleus. In order to integrate information scattered over different articles into a pathway, these diverse surface expressions should be mapped to a single link (R6) in the pathway. All the typographic variants of NF-kappa B as well as its synonyms (p50-p60, etc., shown in bold), have to be recognized. Furthermore, the Localization event is described by using diverse predicates: translocate, take up, migrate and move, in diverse syntactic constructions (underlined). Simple information extraction techniques based on co-occurrences cannot deliver such fine-tuned recognition of events, and structure-based IE would be indispensable [6, 12, 13].

New type of disambiguation problem

Similarly, correspondence events reported in different papers and reactions in a pathway are not straightforward. Consider Figure 5, which shows a detailed interpretation of the reaction sequence, R3~R6. The main concern of the pathway is the state change of the NF-kB from the inactive to the active form (Figure 5(a)). The whole process is broken down into two sub-processes: the dissociation of NF-kB from IkBa, and the localization of NF-kB into the nucleus (Figure 5(b)). The dissociation process is described in more detail by the IkBa degradation process (Figure 5(c)). These three sequences constitute the whole process of NF-kB activation; however, no single sentence in the pathway corpus reveals this whole process. Each sentence focuses on any of three aspects: (a) the functional state of NF-kB, (b) the molecular state of NF-kB, or (c) the molecular state of IkBa. Let us examine the following two sentences:

Figure 5
figure 5

NF-kB activation process from different perspectives. A fragment (R3~R6) of the NF-kB lifecycle pathway in Figure 2 represents the process of NF-kB activation. (a), (b), and (c) illustrate a breakdown of the process from different perspectives: (a) from the perspective of the functional state of the NF-kB, (b) from the perspective of the molecular state of the NF-kB, and (c) from the perspective of the molecular state of the IkBa. Although (c) does not represent NF-kB directly, the state change of IkBa can be interpreted as a part of NF-kB activation process. It is reflected in the entities of the NF-kB/IkBa complex in the NF-kB pathway in Figure 1, which correspond to I7, I8 and I9 in Table 3.

n-Acetyl-L-cysteine (NAC), a potent antioxidant, blocked NF-kappa B activation caused by IL-4 and by anti-CD40 mAb. (8977531-S6)

The proteolytic degradation of the inhibitory protein MAD 3/I kappa B alpha in response to extracellular stimulation is a prerequisite step in the activation of the transcription factor NF-kappa B. (7565683-S2)

Since the first sentence focuses on external factors (NAC, IL-4 and anti-CE40 mAb) which cause the state change of NF-kB from inactive to active, NF-kB activation in this sentence refers the entire NF-kB activation process in Figure 5(a). In contrast, the second sentence discusses the process at the micro-level, and the NF-kB activation refers only to the localization portion in Figure 5(b), which is the moment at which the activation is accomplished. Thus, all aspects of information discussed in different articles must be integrated into a pathway. The pathway (Figure 2) is a result of such an effort by a biologist who has constructed the model pathway.

In Figure 5, the nodes in the same biological contexts are vertically aligned. This alignment assumes the basic knowledge of the biologist including: “the inactive NF-kB is equated with the NF-kB, which is bound to IkB in the cytoplasm,” and inferences such as, “the presence of the NF-kB which is bound to IkB in the cytoplasm also implies the presence of IkBa itself, which is bound to NF-kB in the cytoplasm” (symmetry). The three leftmost nodes aligned vertically in Figure 5 (the inactive NF-kB and the NF-kB and IkBa, which are bound to each other) are all equivalent and correspond to the entity I7 in Table 3.

Constraints on sequences of multiple reactions

The second difficulty concerns the recognition of mutual relationships among multiple reactions. While the research on Information Extraction (IE) in the bio-medical domain has focused on the extraction of events or relations independently of the contexts in which they appear [1416], a pathway construction needs to extract their mutual relationships in the form of causal chains [1719]. Since exact precedence relationships among events are not usually expressed explicitly, we can only extract constraints on causal chains. At the integration stage, we construct a pathway which satisfies all constraints gathered from the whole set of articles.

Explicit causal expressions in text such as, “A causes B,” “A leads to B,” and “A activates B,” indicate constraints that “A precedes B”. Note that these expressions do not necessarily imply direct causation. It is common that there exist intervening reactions in the pathway between A and B in these expressions. Moreover, human biologists infer much richer constraints on sequences of reactions from expressions in text, other than explicit causal expressions. Consider the following sentence:

A fraction of the phosphorylated form of I kappa B alpha remains physically associated with the NF-kappa B complex in vivo but is subject to rapid degradation, thereby promoting the nuclear translocation of the active NF-kappa B complex.

There are four events mentioned in this sentence: (1) the binding of NF-kappa B complex and I kappa B alpha (R2), (2) the phosphorylation of I kappa B alpha (R3), (3) the degradation of I kappa B alpha (R5), and (4) the translocation of NF-kappa B complex (R6). More importantly, from this sentence, a biologist can extract relational constraints among these four events.

From the verb ‘remains’, s/he understands that the event in the subject (R3) does not distort the event in the objective complement (R2), thereby s/he may infer a constraint on the event sequence; that is, that R2 precedes R3. Another constraint is inferred by ‘is subject to.’ It indicates that a state after reaction (R3) in the subject is easily caught by another reaction (R5) in the object. Therefore, the constraint that R3 precedes R5 is inferred. Lastly ‘thereby’ shows that the event in the foregoing sentence (R5) is followed by the event in the subsequent sentence (R6). By putting them together, she/he successfully constructed the portion of these events in the pathway in Fig 2.


The third difficulty comes from the use of biological domain knowledge. In general, at the stage of information integration in a pathway, biologists extensively use their background knowledge as well as the pathway at hand to infer pieces of information, which are implicit in text. We have to understand what cues in text trigger bio-inferences and what facts are to be inferred for pathway construction.

To elucidate the inference process, we classified the evidence sentences according to whether they contain direct expressions of the identified reactions or not (Table 4). Evidence sentences are classified as direct when they contain annotated events in the GENIA event annotation. The events in the GENIA corpus were independently annotated by a group of annotators. The sentences judged as indirect do not contain such corresponding annotated events but nonetheless were identified as evidence sentences. Table 4 shows that evidence sentences were distributed evenly between direct and indirect ones. In the following, we identify what constitutes indirect cues and discuss what types of bio-inferences are triggered by them. The analysis reveals that highly domain-dependent cues trigger bio-inferences.

Table 4 The dichotomy of the sentence expression for pathway representation: direct or indirect.

Classification of bio-inference for pathway construction from text

We analysed evidence sentences regarded as indirect to see whether specific textual cues exist to trigger the inference, and classified them according to their relationships with the inferred reactions. The analysis resulted in six major inference schemas shown in Table 5. An example sentence of each scheme is shown in Table 6.

Table 5 The breakdown of the indirect expressions by ‘inference’ scheme

1a/b. State(s) of entity(-ies) before or after reaction

From the existence of a certain entity (cue) the biologist inferred the associated reaction. Example cues of this class are precursor (1a, ex1) and heterodimer (1b, ex2) (shown in boldface in Table 6). In general, biologists know that a precursor should be processed to the mature protein and that a heterodimer represents two distinct proteins coming together. These cues trigger inference to assume a processing and binding event, respectively.

Table 6 The example sentences of each ‘bio-inference’ scheme

2a/b. Function (s) of entity(-ies) before or after reaction

The next class is the inference scheme based on the existence of a certain entity(-ies) with a specific function. The example cues of these classes are the high-affinity binding site (ex3) and retaining it in the cytoplasm (ex4), respectively. These indicate that a protein mentioned in the sentence has a specific function. The protein that has a binding function with some protein with high affinity will bind it, ten to one, and the one that has a function of retaining another protein never displays its ability without binding to it. The evidence sentences of these two cues appear for R2.

3a/b. Influence of state or functional change(s) of entity(-ies)

This is the class of inferring a latent reaction from the influence caused by state or functional change. Examples of cue phrases of this type are ‘serine, mutated’ (3a, ex5), and ‘proteasome inhibitor’ (3b, ex6), respectively. Due to the state change of the serine mutation, the protein was not phosphorylated. On the flipside, biologists recognize the reaction that the protein is to be phosphorylated on serine if not mutated (R3). Treatment with a proteasome inhibitor changes the function of the proteasome, thereby inhibiting the degradation event of I kappa B alpha (R5). Biologists also know I kappa B alpha is to be degraded by proteasome if not inhibited. Overall, this class of bio-inference is an association of the reaction normally occurring if something (state or function) is not changed.

4. Related reaction

This class of cues is based on the strong association between two reactions. It infers a reaction unmentioned in text from an explicit reaction. The example of the linguistic cue phrase of this class is ‘As observed with’ (Table 4, ex7). The example does not show that any reaction occurred to I kappa B alpha directly. However, because of the expression: ‘As observed with,’ and the biological knowledge that RelA is a transcription factor, and that I kappa B alpha is a protein produced from its gene through the transcriptional and translational event by a transcription factor, biologists recognize the existence of the reaction that RelA transcribes the mRNA of IkBa gene, and IkBa protein will be induced. Further, the gene expression event occurred to IkBa by RelA (R8). Thus, the linguistic cue of this class appears in the sentence structure.

5. Reverse reaction

This class is the inference from the reverse reaction. The example, ex8 describes the dissociation event that occurred between ‘NF-kappa B’ and ‘I kappa B alpha.’ The dissociation must be preceded by binding event (R2), because all proteins are produced as a single molecule. However, the reverse reaction does not always occur; whether the reverse reaction exists or not depends on the domain knowledge.

6. Characteristic of reaction

The last class is the inference scheme based on a qualifier to show a characteristic of a reaction. The example, ex9 shows the degradation event that occurred to ‘IkappaB-alpha’ directly (R5), but does not indicate what caused the reaction, or how it was reaction. Biologists will pay attention to the cue qualifier ‘proteosomal.’ They know that, as biological domain knowledge, its substantive form ‘proteasome’ is a name of a huge protein complex that causes protein degradation. Thus, they can infer that the modifier should be the proteasome of the degradation event on IkBa from the qualifier ‘proteasomal’ that shows the characteristic of a reaction.


Due to the complexity and size of the pathways being, and to be constructed, TM tools for facilitating their construction and maintenance become crucial for bio-medical research. At the same time, text retrieval based on pathways will become an effective means by which biologists may gain access to articles relevant to their interests. For all such scenarios for TM tools to be materialized, technical challenges involved in associating text with pathways have to be properly understood and formulated.

In this paper, we identified the three challenges. The first one, identification of the mapping position of a specific entity in a pathway, is beyond the traditional challenge tasks of accession number assignment for proteins [11] or protein-protein interactions (PPI) [16]. The resulting comparison between a PPI network and a pathway shows that the same biomolecules appear in several places in a pathway, thus making it necessary for us to associate bio-molecular expressions in text with one of their occurrences in a pathway. The correspondence is highly context dependent.

The second challenge is the recognition of constraints on sequences of multiple reactions. While temporal sequences of events have been studied in the general domain, their techniques are mostly concerned with the treatment of temporal expressions [17]. On the other hand, as exemplified in this paper, we have to recognize causal relationships among events, which require inferences based on the domain knowledge as well as subtle textual cues in text, instead of explicit temporal expressions.

The third and the most difficult of the challenges is the problem involved in formulating and implementing bio-inferences. Half of all crucial evidence sentences do not contain an explicit description of events. The specialist inferred implicit events, and the inferences seem to be triggered by specific fragments of expressions in text. While logical or deductive inferences based on the domain knowledge certainly play a role in the process [20], the inferences here are more akin to associative and plausible inferences [21]. How to formulate and exploit such inferences in TM would be the greatest challenge of all.

The analysis given in this paper has been made on the pathway corpus, which has been built based on the GENIA corpus. Since the GENIA pathway corpus is a highly biased corpus [22], and the size is currently not large enough, we are now working on scaling up the pathway corpus by using full papers, and extending the corpus beyond the GENIA set. Figure 6 shows a fuller version of the NF-kB pathway being constructed by a collection of full papers from the whole Medline. To enlarge the pathway corpus as a rich resource, we will append annotations to the collected relevant sentences depending on whether the text is relevant, which reaction it indicates, whether the expression is ‘direct’ or ‘indirect,’ and its inference scheme in cases where the expression ‘indirect.’ The stored data will be also annotated by the GENIA method of event annotation. In the future we will be engaged in another kind of pathway such as epidermal growth factor receptor signalling pathway thus establishing a consistent method of the pathway annotation for the automatic building of a pathway from text.

Figure 6
figure 6

NF-kB pathway full-text version. The NF-kB pathway full-text version, bigger than GENIA version, has 28 entities and 17 reactions. The additional reactions are as follows: R11: binding between PKAc and Complex(NF-kB/IkBa), R12: phosphorylation of IKK Complex by TAK1, R13:phosphorylation of IKKb Serine cluster, R14: degradation of IkBa by 26S proteasome + dissociation of PKAc from NF-kB + NF-kB phosphorylation by PKAc occur almost at the same time, R15: shattling of IkBa between cytosol and nucleus, R16: dissociation of Complex(NF-kB/kB site) by IkBa, R17: binding between NF-kB and IkBa in the nucleus, R18: translocation of Complex(NF-kB/IkBa) from nucleus to the cytosol.


In this study, we presented two new resources: pathway corpus and its corresponding NF-kB pathway, whose mapping among pathway and text is compared with annotated events. Based on a detailed corpus study, we formulated three challenges for TM tools. We also revealed that inferences triggered by highly domain-dependent cues play a central role in recovering events implicit in text but crucial for text-pathway association. We believe the corpus-based study presented in this paper is an important first step for addressing the new TM technology for pathway construction.


Construction of a PPI network and its mapping to a pathway

The PPI network for the proteins in the TLR pathway is constructed by using an IE system [23]. The IE system uses a full parser [24] to reveal the semantic structures of sentences, and then applies simple rules to identify event types. The event type recognition is based on linguistic clues developed by our previous work [8]. A dictionary-based protein name recognizer is used to map the protein names to a set of accession numbers [10].

The distribution of path lengths for the binding and positive regulation events (Figure 1) are calculated after KO identifies the corresponding nodes in the pathway for each protein in the sentences in which the event is recognized. Moreover, the distribution of path lengths for pairs in all extracted events is calculated by assuming that the two nodes which give the shortest path are the nodes corresponding to the two proteins of a given pair. Therefore, the lengths tend to be estimated to be shorter than in actuality.

Event annotation and the pathway corpus

GENIA event annotation was made on half of the GENEA corpus [8, 25], and consists of 1,000 Medline abstracts. It contains 9,372 sentences in which 36,114 events are identified. The annotation was carried out over the process of two years by a group of annotators (four to six biologists) and one coordinator (co-author of the paper, TO). In order to avoid inter-annotator discrepancy, the instructions were given to prevent them from making free inferences.

On the contrary, the pathway corpus was constructed by a single biologist (co-author of the paper, KO) who has a substantial amount of experience in constructing pathways from literature [9, 26]. KO read all of 5, 223 sentences in the 561 abstracts and constructed the pathway, all the links of which were associated with a set of evidence sentences taken from 5,223 sentences. Unlike event annotation, KO established those links freely and then analyzed, retrospectively, what inferences were made in the process.


  1. Bader GD, Cary MP, Sander C: Pathguide: a pathway resource list. Nucleic Acids Res 2006, 34: D504–506. 10.1093/nar/gkj126

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  2. Luciano JS, Stevens RD: e-Science and biological pathway semantics. BMC Bioinformatics 2007, 8(Suppl 3):S3. 10.1186/1471-2105-8-S3-S3

    Article  PubMed Central  PubMed  Google Scholar 

  3. Rzhetsky A, Iossifov I, Koike T, Krauthammer M, Kra P, Morris M, Yu H, Duboue PA, Weng W, Wilbur WJ, et al.: GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. J Biomed Inform 2004, 37: 43–53. 10.1016/j.jbi.2003.10.001

    Article  CAS  PubMed  Google Scholar 

  4. Park JC, Kim HS, Kim JJ: Bidirectional incremental parsing for automatic pathway identification with combinatory categorial grammar. Pac Symp Biocomput 2001, 396–407.

    Google Scholar 

  5. Rajagopalan D, Agarwal P: Inferring pathways from gene lists using a literature-derived network of biological relationships. Bioinformatics 2005, 21: 788–793. 10.1093/bioinformatics/bti069

    Article  CAS  PubMed  Google Scholar 

  6. Santos C, Eggle D, States DJ: Wnt pathway curation using automated natural language processing: combining statistical methods with partial and full parse for knowledge extraction. Bioinformatics 2005, 21: 1653–1658. 10.1093/bioinformatics/bti165

    Article  CAS  PubMed  Google Scholar 

  7. Ohta T, Tateisi Y, Mima H, Tsujii J: GENIA corpus: An annotated research abstract corpus in molecular biology domain. In Proceedings of the Human Language Technology Conference (HLT 2002). San Diego, California; 2002:73–77.

    Google Scholar 

  8. Kim J, Ohta T, Tsujii J: Corpus annotation for mining biomedical events from literature. BMC Bioinformatics 2008, 9: 10. 10.1186/1471-2105-9-10

    Article  PubMed Central  PubMed  Google Scholar 

  9. Oda K, Kitano H: A comprehensive map of the toll-like receptor signaling network. Mol Syst Biol 2006, 2: 2006.0015. 10.1038/msb4100057

    Article  PubMed Central  PubMed  Google Scholar 

  10. Rune S, Yoshida K, Yakushiji A, Miyao Y, Matsubayashi Y, Ohta T: AKANE System: Protein-Protein Interaction Pairs in BioCreAtIvE2 Challenge, PPI-IPS subtask. In In the Proceedings of the Second BioCreative Challenge Evaluation Workshop; April. Madrid, Spain; 2007:1–3.

    Google Scholar 

  11. Morgan A, Hirschman L: Overview of BioCreative II Gene Normalization. In Proceedings of the Second BioCreative Challenge Evaluation Workshop. Madrid, Spain; 2007:7–16.

    Google Scholar 

  12. McDonald DM, Chen H, Su H, Marshall BB: Extracting gene pathway relations using a hybrid grammar: the Arizona Relation Parser. Bioinformatics 2004, 20: 3370–3378. 10.1093/bioinformatics/bth409

    Article  CAS  PubMed  Google Scholar 

  13. Yakushiji A, Miyao Y, Tateisi Y, Tsujii J: Biomedical Information Extraction with Predicate-Argument Structure Patterns. In the First International Symposium on Semantic Mining in Biomedicine. Hinxton, Cambridgeshire, UK; 2005:60–69.

    Google Scholar 

  14. Temkin JM, Gilder MR: Extraction of protein interaction information from unstructured text using a context-free grammar. Bioinformatics 2003, 19: 2046–2053. 10.1093/bioinformatics/btg279

    Article  CAS  PubMed  Google Scholar 

  15. Daraselia N, Yuryev A, Egorov S, Novichkova S, Nikitin A, Mazo I: Extracting human protein interactions from MEDLINE using a full-sentence parser. Bioinformatics 2004, 20: 604–611. 10.1093/bioinformatics/btg452

    Article  CAS  PubMed  Google Scholar 

  16. Krallinger M, Leitner F, Valencia A: Assessment of the Second BioCreative PPI task: Automatic Extraction of Protein-Protein Interactions. In Proceedings of the Second BioCreative Challenge Evaluation Workshop. Madrid, Spain; 2007:41–54.

    Google Scholar 

  17. Wilson G, Mani I, Sundheim B, Ferro L: A multilingual approach to annotating and extracting temporal information. Proceeding of the workshop on Temporal and spatial information processing 2001., 7:

    Google Scholar 

  18. Kontos J, Elmaoglou A, Malagardi I: ARISTA Causal Knowledge Discovery from Texts. In Proceedings of 5th International Conference, DS 2002. Springer Berlin /Heidelberg; 2002:348–355. Nov 24–26; Lubeck, Germany

    Google Scholar 

  19. Kim J, Ohta T, Oda K, Tsujii J: From Text to Pathway: Corpus Annotation for Knowledge Acquisition from Biomedical Literature. Proceedings of the 6th Asia Pacific Bioinformatics Conference (APBC) 2008. to appear

    Google Scholar 

  20. Schulz S, Kumar A, Bittner T: Biomedical ontologies: what part-of is and isn't. J Biomed Inform 2006, 39: 350–361. 10.1016/j.jbi.2005.11.003

    Article  PubMed  Google Scholar 

  21. Tsujii J, Ananiadou S: Thesaurus or logical ontology, which do we need for mining text? Language Resources and Evaluation 2005, 39: 77–90. 10.1007/s10579-005-2697-0

    Article  Google Scholar 

  22. Krallinger M, Malik R, Valencia A: Text mining and protein annotations: the construction and use of protein description sentences. Genome Inform 2006, 17: 121–130.

    CAS  PubMed  Google Scholar 

  23. Miyao Y, Ohta T, Masuda K, Tsuruoka Y, Yoshida K, Ninomiya T, Tsujii J: Semantic Retrieval for the Accurate Identification of Relational Concepts in Massive Textbases. In Proceedings of COLING-ACL 2006. July; Sydney, Australia; 2006:1017–1024.

    Google Scholar 

  24. Miyao Y, Tsujii J: Feature Forest Models for Probabilistic HPSG Parsing. Computational Linguistics 2008.

    Google Scholar 

  25. Ohta T, Tateisi Y, Kim J, Yakushiji A, Tsujii J: Linguistic and Biological Annotations of Biological Interaction Events. In Proceedings of The Fifth International Conference on Language Resource and Evaluation (LREC 2006). Edited by: Calzolari N. May; Genoa, Italy; 2006:1405–1408.

    Google Scholar 

  26. Oda K, Matsuoka Y, Funahashi A, Kitano H: A comprehensive pathway map of epidermal growth factor receptor signaling. Mol Syst Biol 2005, 1: 2005.0010. 10.1038/msb4100014

    Article  PubMed Central  PubMed  Google Scholar 

Download references


This work was partially supported by Grant-in-Aid for Specially Promoted Research (Ministry of Education, Culture, Sports, Science and Technology (MEXT), Japan), Genome Network Project (MEXT, Japan), and Special Coordination Funds for Promoting Science and Technology (MEXT, Japan).

This article has been published as part of BMC Bioinformatics Volume 9 Supplement 3, 2008: Proceedings of the Second International Symposium on Languages in Biology and Medicine (LBM) 2007. The full contents of the supplement are available online at

Author information

Authors and Affiliations


Corresponding author

Correspondence to Jun'ichi Tsujii.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

KO carried out the construction of the pathway corpus, NF-kappa B pathway, manually annotated to link to the pathway, analyzed the inference scheme, and drafted the manuscript. JDK participated in the design of the study, carried out an investigation about the mapping position in a pathway, and drafted the manuscript. TO chiefly carried out the GENA event annotation and drafted the manuscript. DO and TM constructed a PPI network and mapped it to a Pathway, and drafted the manuscript. YT participated in the design of the study and drafted the manuscript. JT conceived of the study, participated in its design and coordination, and helped to draft the manuscript. All authors read and approved the final manuscript.

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Oda, K., Kim, JD., Ohta, T. et al. New challenges for text mining: mapping between text and manually curated pathways. BMC Bioinformatics 9 (Suppl 3), S5 (2008).

Download citation

  • Published:

  • DOI: