Skip to main content
Figure 1 | BMC Bioinformatics

Figure 1

From: A bioinformatics knowledge discovery in text application for grid computing

Figure 1

Knowledge discovery in text process. This figure shows the Knowledge Discovery in Text process. It is composed by Text Refining and Text Mining phases. The former transforms a free-form text document into a chosen Intermediate Form while that latter deduces patterns or knowledge from the Intermediate Form. Text Refining input are not-structured data such as texts or semi-structured data like HTML pages. It consists of Tokenization, which splits a text document into a stream of words by removing all punctuation marks and by replacing tabs and other non-text characters with single white spaces, and Filtering methods, which remove words like articles, conjunctions, prepositions, etc. from the documents. Lemmatization methods try to map verb forms to the infinite tense and nouns to their singular form. Stemming methods attempt to build the basic forms of words, for example, by stripping the plural 's' from nouns, the 'ing' from verbs, or other affixes. Additional linguistic pre-processing may be needed to enhance the available information about terms: N-grams individualization, which is n-word generic sequences that do not necessarily correspond to an idiomatic use; Anaphora resolution, which can identify relationships among a linguistic expression (anaphora) and its preceding phrase, thus, determining the corresponding reference; Part-of-speech tagging (POS) determines the part of speech tag, noun, verb, adjective, etc. for each term; Text chunking aims at grouping adjacent words in a sentence; Word Sense Disambiguation (WSD) tries to resolve the ambiguity in the meaning of single words or phrases; Parsing produces a full parse tree of a sentence (subject, object, etc.). Text Refining output can be stored in database, XML file or other structured forms which are referred to as the Intermediate Form. Text Mining techniques are then applied to the Intermediate Form. The Text Mining phases are: document clustering, document categorization, and pattern extraction.

Back to article page