Introduction
With an overwhelming amount of biomedical knowledge recorded in texts, it is not surprising that there is so much interest in techniques which can identify, extract, manage, integrate and exploit this knowledge, and moreover discover new, hidden or unsuspected knowledge. For this reason, in the past five years, there has been an upsurge of research papers and overviews [1–5] on the topic of text mining from biomedical literature. In order to facilitate knowledge discovery in biomedicine there is a need for approaches which harvest and integrate information from text, biological databases, ontologies and terminological resources. To discover knowledge hidden in the large amount of biomedical texts, we need text mining techniques which go to levels of linguistic processing deeper than simple lexical and syntactic processing. Text mining beyond surface levels requires semantic information. In order to carry out semantic mining certain prerequisites are assumed such as rich levels of linguistic and semantic annotation supported by ontologies and other knowledge sources that provide the semantics of the annotation. Semantic text mining enables us to capture the relevant content of documents according to user needs.
Semantic mining relies crucially on the following steps:
-
named entity recognition
-
discovery of semantic relations between entities
-
event discovery
The current limitations of using existing terminological and ontological resources such as the Gene Ontology, Swiss-Prot, Entrez Gene, UMLS, and Mesh etc. have been well documented [6, 7]. The entries are not useful for specific text searches and they do not contain the types of variability encountered in text. Results from the BioCreAtIvE [8] evaluation challenge reflect the problems related with named entity recognition in biomedicine. Term ambiguity (e.g. homologues, overlap with general language words) and term variation phenomena (spelling, morphological variants) account for the low performance of named entity recognisers for biology in comparison to newswire.
The existence of semantically annotated corpora for testing and training are of paramount importance for efficient semantic mining based on NLP techniques. The bio-text mining community must develop resources both in the form of semantically enriched lexical resources (from ontologies) or richly annotated corpora like GENIA [9], PennBioIE [10], GENETAG [11], etc.
Evaluation of information extraction systems (e.g. for protein-protein interactions) has been based till now on rather small benchmark sets mainly generated by research groups through use of their system. A promising path is presented with the BioCreative 06 [12] evaluation challenge which will provide common benchmark sets for training and testing of different systems for the extraction of protein-protein interactions out of full text articles. This assessment will show how close we are to providing solutions for real-world problems in molecular biology and biomedicine.
But this is just one part of the story so far: in order for text mining to exploit semantic data mining, we need to integrate the results of text mining not only with knowledge resources but most crucially with experimental data which will lead to biological discoveries. The analysis of high-throughput data in combination with textual extracted information about relationships of the investigated entities will allow biologists to make predictions about novel interactions.