Automatic extraction of biomolecular interactions: an empirical approach

Background We describe a method for extracting data about how biomolecule pairs interact from texts. This method relies on empirically determined characteristics of sentences. The characteristics are efficient to compute, making this approach to extraction of biomolecular interactions scalable. The results of such interaction mining can support interaction network annotation, question answering, database construction, and other applications. Results We constructed a software system to search MEDLINE for sentences likely to describe interactions between given biomolecules. The system extracts a list of the interaction-indicating terms appearing in those sentences, then ranks those terms based on their likelihood of correctly characterizing how the biomolecules interact. The ranking process uses a tf-idf (term frequency–inverse document frequency) based technique using empirically derived knowledge about sentences, and was applied to the MEDLINE literature collection. Software was developed as part of the MetNet toolkit (http://www.metnetdb.org). Conclusions Specific, efficiently computable characteristics of sentences about biomolecular interactions were analyzed to better understand how to use these characteristics to extract how biomolecules interact. The text empirics method that was investigated, though arising from a classical tradition, has yet to be fully explored for the task of extracting biomolecular interactions from the literature. The conclusions we reach about the sentence characteristics investigated in this work, as well as the technique itself, could be used by other systems to provide evidence about putative interactions, thus supporting efforts to maximize the ability of hybrid systems to support such tasks as annotating and constructing interaction networks.


APPENDIX A: PREVALENCE OF USEFUL TRIPLES
We performed a preliminary study to help determine the scope of the problem of extracting, from sentences, triples that describe two interacting biomolecules and an interaction indicating term (IIT) correctly characterizing the interaction. For this we used the IEPA corpus (Appendix B;Ding et. al 2002). Our analysis focused on an enriched subset of triples found in these sentences, namely those for which the IIT was in the same phrase as at least one of the biomolecule names. We called these admissible sentence triples. The rationale was that admissible sentence triples would be relatively likely to describe an interaction compared to other triples. Analysis 1. The first analysis showed that 55% (331 out of 606) of co-occurring biomolecule names associated with one or more admissible sentence triples described an interaction. We concluded the following.
(1) Many co-occurrences did not describe an interaction.
(2) Fewer than 55% of the new admissible sentence triples described an interaction because some biomolecule cooccurrences were due to the presence in the sentence of multiple IITs. Usually, only one of them described an interaction of that biomolecule co-occurrence.
The issue of one co-occurrence being in multiple admissible sentence triples was further investigated in Analysis 2, next.

Analysis 2.
Determining what IIT is semantically associated with a particular biomolecule co-occurrence would be easier to get right if every sentence of interest had exactly one co-occurrence and one IIT. However, since many sentences have one co-occurrence and multiple IITs, it is more difficult to determine which is the applicable IIT. Another category of challenging sentences consists of those with multiple co-occurrences.
For example, if a sentence contains the three biomolecule names A, B, and C, then it has the three co-occurrences AB, AC, and BC. Matching IITs that may be present with their associated co-occurrences then becomes an issue. To better understand the scope of this problem, we analyzed the same corpus and created Table 6.
The key row in Table 6 is bolded. Note that the restriction to admissible sentence tri-occurrences improves the situation compared to looking at sentence tri-occurrences in general, because the set of admissible sentence trioccurrences excludes some sentence tri-occurrences that are relatively unlikely to indicate an interaction.  Table 6. Comparison of co-occurrences and tri-occurrences, and sentences and phrases, with respect to their richness as sources for mining biomolecular interactions. Key: Phrase co-occurrence: two biomolecules present in the same phrase. Phrase tri-occurrence: an IIT and phrase co-occurrence all within the same phrase. Sentence co-occurrence: two biomolecules present in the same sentence (a phrase co-occurrence is also a sentence co-occurrence). Admissible sentence tri-occurrence is defined in the text. Phrase was operationally defined as follows.
1) The beginning or end of a sentence is also the beginning or end of a phrase. 2) {, ; :} each indicate the end of one phrase and the beginning of the next. 3) <whitespace> -<whitespace> (a dash or hyphen with whitespace on each side) indicates the end of one phrase and the beginning of the next. 4) Left and right parentheses indicate the start and end of a phrase. 5) Only phrases containing a co-occurrence of specific typical biomolecules (analyzed in detail in Deng et al. (2002)), or their synonyms, were considered.
Only 22% of admissible sentence tri-occurrences described an interaction, so that 78% did not. That seemingly challenging fact suggests the possibility of simply ignoring "complicated" sentences and restricting analysis to ones with one co-occurrence and one IIT (that is, one admissible sentence tri-occurrence). Yet doing so might exclude con-siderable potentially useful data. This issue was investigated in analysis 3, next. Analysis 3. We investigated the prevalence of multiple IITs in sentences using the same corpus as for Analysis 1 and Analysis 2. Figure 7 shows the results. Even limiting attention to sentences containing just one co-occurrence of a pair of biomolecules (that is, each biomolecule is named once in the sentence), most had more than one IIT (n IITs imply n tri-occurrences for a sentence with a single cooccurrence). About 10% of the sentences had no IIT, and about 20% had just one, so about 70% had two, three, four, five, and in a few cases even more.
One may conclude from Figure 7 that restricting attention to sentences with just one IIT would mean ignoring most of the data. But considering sentences with multiple IITs means grappling with the challenge of deciding which IITs properly describe an interaction and which do not, thus motivating the IIT ranking technique presented in this article.