A bioinformatics knowledge discovery in text application for grid computing
- Marcello Castellano†1Email author,
- Giuseppe Mastronardi†1,
- Roberto Bellotti†2 and
- Gianfranco Tarricone†1
© Castellano et al; licensee BioMed Central Ltd. 2009
Published: 16 June 2009
A fundamental activity in biomedical research is Knowledge Discovery which has the ability to search through large amounts of biomedical information such as documents and data. High performance computational infrastructures, such as Grid technologies, are emerging as a possible infrastructure to tackle the intensive use of Information and Communication resources in life science. The goal of this work was to develop a software middleware solution in order to exploit the many knowledge discovery applications on scalable and distributed computing systems to achieve intensive use of ICT resources.
The development of a grid application for Knowledge Discovery in Text using a middleware solution based methodology is presented. The system must be able to: perform a user application model, process the jobs with the aim of creating many parallel jobs to distribute on the computational nodes. Finally, the system must be aware of the computational resources available, their status and must be able to monitor the execution of parallel jobs. These operative requirements lead to design a middleware to be specialized using user application modules. It included a graphical user interface in order to access to a node search system, a load balancing system and a transfer optimizer to reduce communication costs.
A middleware solution prototype and the performance evaluation of it in terms of the speed-up factor is shown. It was written in JAVA on Globus Toolkit 4 to build the grid infrastructure based on GNU/Linux computer grid nodes. A test was carried out and the results are shown for the named entity recognition search of symptoms and pathologies. The search was applied to a collection of 5,000 scientific documents taken from PubMed.
In this paper we discuss the development of a grid application based on a middleware solution. It has been tested on a knowledge discovery in text process to extract new and useful information about symptoms and pathologies from a large collection of unstructured scientific documents. As an example a computation of Knowledge Discovery in Database was applied on the output produced by the KDT user module to extract new knowledge about symptom and pathology bio-entities.
The progress in biomedical field largely relies on results which are obtained both in laboratories and institutions from around the world and published in several journals. With the amount of publications increasing daily, the problem of searching for highly specific data is getting more difficult. As one of frequent activities for the study of biomedicine, bio-entity recognition is receiving greater attention. Bio-entity recognition aims to identify and classify technical terms corresponding to the instances of concepts that are of interest to molecular biologists. Examples of such entities include the names of proteins, genes, their locations of activity such as the names of cells or organisms, drugs, symptoms, pathologies and so on. Entity recognition is becoming increasingly important with the massive increase in reported results due to high throughput experimental methods. It can be used in several higher level information access tasks such as relation extraction, summarization and question answering. Recognising biological entities in texts allow further extraction of relationships and key concepts of interest and allowing those concepts to be represented in some consistent, normalised form. This task is challenging for several reasons, because a complete dictionary of biological entities does not exist, hence, simple text matching algorithms do not produce reliable results. In addition, the same word or phrase can refer to a different thing depending upon the context and some biological entities have several names. Moreover, biological entities can have multi-word names which can complicate the task with the need to determine name boundaries and resolve the overlap of candidate names. Because of the potential utility of this recognition and the complexity of the problem, named entity recognition has attracted the interest of many researchers, and generated much research. With the large amount of genomic information being generated by biomedical researchers, it should not be surprising that in the genomics era, much of the work in biomedical name-entity recognition has focused on identifying gene and protein names in free text [1, 2].
Although the search problem has been simplified by search engines, the number of results returned is usually very large, while the relevance of the results may be small. The search based on keywords is unable to answer specific questions about the location and usage of the keywords in retrieval documents. For all these reasons the problem of discovering useful knowledge from unstructured text, is attracting increasing attention. The solution of this problem is called Knowledge Discovery in Text and it refers to the process of extracting interesting and not-retrieval patterns or knowledge from unstructured text documents. The application of Knowledge Discovery in Text in the biomedical field can improve efficiency for researchers by shifting the burden of information overload from them to the computer by applying Text Mining (TM) automatic procedure. TM examines the relationships between specific kinds of information contained in a single document or across a whole volume of documents. For example, TM can aid database curators by selecting the articles most likely to contain information of interest. This could then lead to the discovery of potential treatments for migraines by looking for pharmacological substances that are associated with biological processes about migraines. Knowledge discovery in text and applications of this process are available in the literature [3–6].
The problems of application based on the mining methods, described so far, often occur in data-intensive situations. These situation require that the same logic be applied to a large collection of different data independent from each other. Hence, the limits will be technological if these problems are addressed by traditional machines that sequentially perform the same set of instructions on an entire collection of homogeneous and independent data. The time required for execution will increase according to the size of the collection, hence, this will become the limiting factor in these applications. For awhile now, computing literature was offered possible solutions by proposing for parallel calculating like SIMD. This latter supercalculator, however, regards expensive centralized computing systems. A more economic solution with dynamic scaling characteristics according to the size of the data collection to be analyzed, is offered by systems weakly linked to calculating networks. Recently, a computational paradigm is being explored which suggests creating computer technology pools. These pools have a high use efficiency and can achieve performance levels comparable monolithic calculating systems, i.e. supercomputers. The use of this technology is called Grid Computing. The type of computing is based on the use of a basic middleware infrastructure on which a middleware solution is constructed. In other words, services which orient the infrastructure to a specific class of use. Much effort is being made in Europe and internationally to develop this calculating tool for users in the fields of physics, biology and research in general [7–9].
Bio-medical informatics is one of the areas in which Grid technology advances could bring significant benefit for the search studies of scientists well as the everyday work of clinicians. Recently, there has been much excitement in the distributed and parallel systems community as well as that of distributed database applications in the emergence of Grids as the platform for scientific and medical collaborative computing. Grid computing promises to resolve many of the difficulties in facilitating medical informatics and medical image analysis by allowing radiologists and clinicians to collaborate without having to co-locate. Grid technology can potentially provide medical applications with an architecture for easy and transparent access to distributed heterogeneous resources, like data storage, networks, computational resources, across different organizations and administrative domains. The Grid offers a configurable environment whereby structures can be reorganized dynamically without affecting any overall active Grid processing. In particular, the Grid can address the following issue relevant to bio-medical domains: data distribution, that is, the Grid provides connectivity for medical data distributed over different sites heterogeneity, that is, the Grid addresses the issue of heterogeneity by developing common interfaces for access and integration of diverse data sources; data processing and analysis, that is, the Grid offers a platform for transparent resource management in medical analyses; security and confidentiality, that is, enabling secure data exchange between hospitals distributed across networks which is one of the major concerns of medical applications [10–13]. Even though the projects at international European and National levels attempt to achieve these goals on a large scale, work which reconstructs the scenario on a small scale can allow laboratory analyses through the testing of small problems which occur like the experimentation of new analytic procedures at the application level.
In this work, we present a feasibility study to build a middleware for SIMD applications. Their performance is demonstrated with a case study of named bio-entity recognition. The application is based on the knowledge discovery in text to annotate new knowledge from unstructured textual documents. Moreover, the middleware offers the ability to perform the application in a distributed environment using grid computing. In particular, software platform GATE was used to perform automatic analysis of scientific documents. GATE is a toolkit used with the GATE Java API and its documentation is available in [14, 15]. Globus is a toolkit which enables the construction of middleware grid services oriented towards data-intensive applications. A large amount of documentation is available in . Finally, it should be noted that new knowledge discovery procedures could applied to the results of textual analyses to generate new knowledge. An example of this is shown with an application known Knowledge Discovery in Database (KDD). The study for the development of a middleware solution which little by little can supply the user with more and more instruments for the analysis of knowledge discovery could define new knowledge discovery procedures. These developments would be of great use for studies in fields such as bio-medicine.
The bioinformatics application, discussed in this paper, concerns the extraction of biological entities related to symptoms and pathologies from a large collection of biomedical papers. In addition, the application searches for new knowledge about them using the knowledge discovery in text for grid computing. In this section, we briefly describe the KDT methodology and then we explain how to simplify a data-intensive application in a SIMD scalable job from the data and cpu computational resource point of view in a grid environment.
The experimental set-up used to execute the test-run of the prototype was as follows.:
b. A data collection consisting of 5,000 medical publications in text files format was created by PubMed Central Repository
c. A list of keyword about Symptoms and another keyword list about Pathologies have been created by and respectively. They were specified as LST file as required by the template file of ANNIE Plug-in for GATE; http://www.wrongdiagnosis.com/lists/symptoms.htm http://www.wrongdiagnosis.com/lists/condsaz.htm
e. A computational grid was based on three computational nodes, the Server, Alfa and Beta, Gnu/Linux machines operative on 100 Mbps Ethernet LAN were created using Globus Toolkit 4.0.5 with the access interface for Condor on the pool. Prior to this, the Condor scheduling system had been installed for each machine. The follows grid services were configured: GridFTP, for file transfer, GRAM for resource management and job submission, the MDS monitoring and discovery system (the information services component of Globus) and RTF for the secure file transfer operating solely on the server node, which is a central grid node for the reliable transfer management
f. GATE installation and configuration on all grid nodes.
This study began from the standpoint that, in biological research, new finding can be expressed through the analysis correlated, unstructured information present in publications and scientific documents. The application executed in this study adapted the Knowledge Discovery in Text process to the task of extracting biomedical knowledge, in terms of symptoms and pathologies. This facility could be a profitable support for physicians and medical researches needing to make important decisions. The strong points of the proposed system are that it can be used for applications in which the data can be partitioned into different and independent data-sets. Moreover, another fundamental characteristic of the proposed system was the grid-based approach, which was to be able to supply high performance computing infrastructures to satisfy computational problems in this field. Finally, we believe it is useful to emphasize that the knowledge discovery process in text should be considered one phase in a larger knowledge discovery program. Here, we have briefly reported a part of the finding obtained by applying to the knowledge output from KDT a further important process of Knowledge Discovery in Database (KDD). The field of KDD includes a new generation of techniques and tools for the automatic and intelligent analysis of large volumes of data, "data mines", in order to extract hidden knowledge.
In this paper we have presented the development of a middleware solution for a Bioinformatics Knowledge Discovery in Text process. It was designed for medical text documentation using a testbed computational Grid based on Globus middleware. We have discussed a Knowledge Discovery in Text process performed on medical papers with the purpose of identifying all the specific names for biological entities with particular attention placed on the name recognition of symptoms and pathologies. Particular attention has been given to the grid-based environment, its software architecture and how it may be possible to design a modular application to use GATE functionalities in a grid-based solution.
The authors acknowledge the financial support provided by the Italian Ministry of Education, University and Research and by the e.B.I.S. s.r.l. (electronic Business in Security), Spin-Off of Polytechnic of Bari, which have made possible the realization of this work as result of our research activities.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 6, 2009: European Molecular Biology Network (EMBnet) Conference 2008: 20th Anniversary Celebration. Leading applications and technologies in bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S6.
- Leser U, Hakenberg J: What makes a gene name? Named entity recognition in the biomedical literature. Brief Bioinform 2005, 6(4):357–369.View ArticlePubMedGoogle Scholar
- Shatkay H, Feldman R: Mining the biomedical literature in the genomic era: an overview. J Comput Biol 2003, 10(6):821–855.View ArticlePubMedGoogle Scholar
- Hotho A, Numberger A, Paab G: A brief Survey of Text Mining. LDV Forum – GLDV Journal for Computational Linguistics and Language Technology 2005, 20(Suppl 1):19–62.Google Scholar
- Cohen AM, Hersh WR: A survey of current work in biomedical text mining. Brief Bioinform 2005, 6(1):57–71.View ArticlePubMedGoogle Scholar
- Zweigenbaum P, Demner-Fushman D, Yu H, Cohen KB: Frontiers of biomedical text mining: current progress. Brief Bioinform 2007, 8(5):358–375.PubMed CentralView ArticlePubMedGoogle Scholar
- Biomedical Literature (and text) Mining Publications[http://blimp.cs.queensu.ca/]
- Foster I, Kesselmann C: The Grid: Blueprint for a New Computing Infrastructure. Morgan-Kaufmann; 1998.Google Scholar
- The DataGrid Project[http://eu-datagrid.web.cern.ch/]
- EGEE Enabling Grids for E-sciencE[http://www.eu-egee.org/]
- EMBRACE Network of Excellence – A European Model for Bioinformatics Research and Community Education[http://www.embracegrid.org/page.php?page=home]
- MAGIC-5 INFN Medical Application on a Grid Infrastructure Connection[http://www.magic5.unile.it/]
- The BioinfoGRID Project – Bioinformatics Grid Application for life science[http://www.bioinfogrid.eu/]
- Talbi EG, Zomaya AY: Grid Computing for Bioinformatics and Computational Biology. Wiley Interscience; 2007.View ArticleGoogle Scholar
- Cunningham H, Maynard D, Bontcheva K, Tablan V: GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL'02): July 2002; PhiladelphiaGoogle Scholar
- GATE-General Architecture for Text Engineering[http://gate.ac.uk/]
- The Globus Alliance[http://www.globus.org/]
- Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy R: Advances in Knowledge Discovery and Data Mining. The MIT Press; 1996.Google Scholar
- Nahm U, Mooney R: Using Information Extraction to Aid the Discovery of Prediction Rules from Text. Proceedings of the 6th International Conference Knowledge Discovery and Data Mining (KDD-2000) Workshop on Text Mining: August 2000; Boston, MassachusettsGoogle Scholar
- Bunescu RC, Mooney RJ: Extracting Relations from Text: From Word Sequences to Dependency Paths. In Text Mining and Natural Language Processing. Edited by: Kao A, Poteet S. Springer; 2007:29–44.View ArticleGoogle Scholar
- Mooney R, Bunescu R: Mining Knowledge from Text Using Information Extraction. SigKDD Explorations special issue on Text Mining and Natural Language Processing 2005, 7(Suppl 1):3–10.Google Scholar
- Castellano M, Mastronardi G, Aprile A, Decataldo G, Dicensi V, Pisciotta L, Tarricone G: Knowledge Discovery in Biomedical Documents using Text Mining Approach: an Application to Named Entity Recognition. GESTS International Transaction on Computer Science and Engineering 2008, 45(Suppl1):9–20.Google Scholar
- Carvalho PC, Glória RV, de Miranda AB, Degrave WM: Squid – a simple bioinformatics grid. BMC Bioinformatics 2005, 6: 197.PubMed CentralView ArticlePubMedGoogle Scholar
- Hirmer S, Kaiser H, Merzky A, Hutanu A, Allen G: Generic support for bulk operations in grid applications. Proceedings of the 4th international workshop on Middleware for grid computing: 2006; Melbourne, AustraliaGoogle Scholar
- Castellano M, Mastronardi G, Decataldo G, Pisciotta L, Tarricone G, Cariello L, Bevilacqua V: Biomedical Text Mining Using a Grid Computing Approach. In LNCS Advanced Intelligent Computing Theories and Applications. With Aspects of Artificial Intelligence. Volume 522. Springer Berlin/Heidelberg; 2008:1077–1084.View ArticleGoogle Scholar
- StandAloneAnnie.java file[http://gate.ac.uk/gate-examples/doc/java2html/sheffield/examples/StandAloneAnnie.java.html]
- Witten IH, Frank E: Data Mining: Practical machine learning tools and techniques. 2nd edition. Morgan Kaufmann: San Francisco; 2005.Google Scholar
- Weka Machine Learning Project[http://www.cs.waikato.ac.nz/ml/weka/]
- Talia D, Trunfio P, Verta O: Weka4WS: a WSRF-enabled Weka Toolkit for Distributed Data Mining on Grids. In Proceedings of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2005): October 2005; Porto, Portugal. Springer-Verlag: LNAI 3721; 2005:309–320.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.