KID - an algorithm for fast and efficient text mining used to automatically generate a database containing kinetic information of enzymes
© Heinen et al; licensee BioMed Central Ltd. 2010
Received: 1 July 2009
Accepted: 13 July 2010
Published: 13 July 2010
The amount of available biological information is rapidly increasing and the focus of biological research has moved from single components to networks and even larger projects aiming at the analysis, modelling and simulation of biological networks as well as large scale comparison of cellular properties. It is therefore essential that biological knowledge is easily accessible. However, most information is contained in the written literature in an unstructured way, so that methods for the systematic extraction of knowledge directly from the primary literature have to be deployed.
Here we present a text mining algorithm for the extraction of kinetic information such as KM, Ki, kcat etc. as well as associated information such as enzyme names, EC numbers, ligands, organisms, localisations, pH and temperatures. Using this rule- and dictionary-based approach, it was possible to extract 514,394 kinetic parameters of 13 categories (KM, Ki, kcat, kcat/KM, Vmax, IC50, S0.5, Kd, Ka, t1/2, pI, nH, specific activity, Vmax/KM) from about 17 million PubMed abstracts and combine them with other data in the abstract.
A manual verification of approx. 1,000 randomly chosen results yielded a recall between 51% and 84% and a precision ranging from 55% to 96%, depending of the category searched.
The results were stored in a database and are available as "KID the KInetic Database" via the internet.
The presented algorithm delivers a considerable amount of information and therefore may aid to accelerate the research and the automated analysis required for today's systems biology approaches. The database obtained by analysing PubMed abstracts may be a valuable help in the field of chemical and biological kinetics. It is completely based upon text mining and therefore complements manually curated databases.
The database is available at http://kid.tu-bs.de. The source code of the algorithm is provided under the GNU General Public Licence and available on request from the author.
The availability of a number of different OMICS technologies has made it possible that - in addition to the traditional molecular biology methods - whole "systems", from molecular networks via cells and organs to whole organisms have become the focus of large scale research projects in all biosciences. Whereas it is still possible to manually follow the literature in a certain limited area the rapid growth of scientific literature does not allow to e.g. extract the information on all enzymes in a certain organism from the literature in a sensible time, or to make large scale comparisons between the metabolic functions of different organisms. Moreover in areas of drug development the knowledge on binding properties between enzyme and ligand is essential .
Several databases are available providing information about enzymes and their characteristics like e.g. BRENDA [2–4] with currently 92,291 entries for KM, 32,484 for kcat, 21,833 for Ki and 33,372 for specific activity , Kinetikon , KMedDB , KDBI , DOQCS , SABIO-RK  and IUPAC-kinetic , respectively. However, these databases are far from complete, forcing scientists to a time consuming manual extraction of values from the literature if a systematic research approach is followed.
One approach for a faster and simpler access to this information is to use text mining [11–14], i.e. the computer aided extraction of data from natural written text [1, 15–17]. Current algorithms include machine learning (e.g. Kinetikon ), statistic (e.g. FRENDA and AMENDA ), rule-based (KiPar  and BioRAT ) and mixed approaches (SUISEKI ).
Here we present a rule- and dictionary-based [1, 15] text mining algorithm for the extraction of kinetic data, developed with a focus on a fast calculation time and a high precision of the received information.
The attained kinetic enzyme information is stored and presented in the database "KID the Kinetic Database", which contains information extracted from about 17 million PubMed  abstracts.
Construction and content
Dictionaries used for identifying entities in the text and their size.
Number of entries
expressions for KM
expressions for Ki
expressions for Kd
expressions for kcat
expressions for IC50
expressions for Vmax
expressions for nH
expressions for t1/2
expressions for kcat/KM
expressions for pI
expressions for Ka
expressions for S0.5
expressions for specific activity
expressions for Vmax/KM
expressions für pH
expressions for temperature
units for Vmax
units for specific activity
units for kcat/KM
units for t1/2
units for kcat
units for Vmax/KM
units for Ka
Since one term can only be mapped to one category (see below), ambiguous terms have to be assigned to one dictionary or excluded from the search. For example "IPP" is an acronym used for an enzyme (inositol-1,4-bisphosphate 1-phosphatase) as well as for a ligand (isopentenyl diphosphate).
Extract from the dictionary for KM.
Synonym for KM
affinity constant (km)
16,953,021 PubMed  abstracts available in 2007 were analysed. Each abstract is split into sentences when a dot followed by a whitespace is detected. Notable exceptions to this rule are recognized abbreviations like "i.v." or "e.g.". The sentences are hereafter translated into token  by splitting at whitespace.
The algorithm is divided into two parts, the identification of entities in the text (i.e. tagging ) and a rule-based linkage of these units.
For the process of identification the tokens in a sentence are examined for their existence within the hash one by one, starting from a token like "glucose" and proceeding with the following words. The longest sequence of tokens with a flag on the last token is accepted and the combination is then marked as a phrase carrying the associated flag.
Numbers are recognized in a second step using regular expressions in sentences where previously a kinetic expression has been found. If a number is followed by a kinetic unit, then both are combined to a common phrase. If certain phrases, for example a unit which does not match the kinetic expression, are found right behind the number, it is removed, which reduces the risk of incorrect linkage.
If ligands are following a negation phrase like e.g. "in absence of ATP", the ligand is also removed.
A special treatment is applied for enumerations, i.e. if more than one entity of the same category is found in one sentence, being separated by commas and "or" or "and" between the last two entities. These terms are mapped one after the other to the entities of the remaining categories.
In sentences which contain more than one enumeration the collections with the largest number of terms are linked sequentially. I.e. in the phrase "..results for enzyme a and enzyme b with ligand c and ligand d.." enzyme a will be linked to ligand c and enzyme b will be linked to ligand d. This linkage of enumerations is not applied to kinetic categories.
If one of the categories cannot be filled by direct linkage, an indirect linkage takes place. It is checked whether one and only one entry of the corresponding category is present in the sentence and if so, this entry is accepted. If the indirect search over the sentence is not successful, an indirect search over the abstract, followed by a search over the title, will be carried out with the same mechanism. The indirect linkage is not performed for pH and temperature on level of the title and the abstract, since the entities of these categories are numbers not marked by a unit and can therefore not be distinguished from numbers not related to the kinetic constant. During indirect linkage on level of the sentence the missing unit helps to isolate the number from the one belonging to the kinetic category.
In the case that an enzyme name is found but no according EC number (or the opposite), it is checked if this information can be added automatically with a query in BRENDA .
Distribution of linkages
For enzyme names most of the results are linked indirectly on the title level (20%). The indirect linkage on level of the sentence and on level of the abstract is carried out in 12% and 14% of the cases, respectively. A smaller amount (6%) is linked directly.
Concerning the organisms, the indirect linkage on the abstract level (27%) and on the level of the title (24%) is used most often. Approximately 10% are linked indirectly on the level of the sentence.
For the localisation, the linkage mainly takes place on level of the title (21%) and the abstract (11%). 17% are linked by the use of the indirect linkage on the sentence level.
The majority of the EC numbers are annotated automatically from BRENDA (17%), a small number is linked on the level of the abstract (3%).
For numerical values the direct assignment is mainly applied with a ratio of 29%. 12% of the values are extracted from listings and in 7% of the cases the linkage takes place indirectly on level of the sentence.
Comparatively few values are linked for pH and temperature. 1.3% and 1.6% are linked indirectly on the sentence level, respectively.
Summarizing, the direct linkage is successful for linking numbers (29% of linked numbers) and ligands (18% of linked ligands), whereas the indirect linkage is used for the linkage of ligands (22% on the level of the same sentence), organism and localisation (24% and 21% on the level of the title and 28% and 11% on the level of the abstract, respectively) and enzyme names (22% on the level of the title).
Using the linkage it was possible to generate more than one result per abstract for 46.5% of the abstracts containing kinetic information.
Content of the database
Amount of extracted entities for the corresponding kinetic categories.
Comparison of the content from different databases providing kinetic information.
Evaluation of the database
Breaking down the precision into types of linkages results in an overall precision of 91%, 78%, 89% and 88% for direct linkage, indirect linkage on the sentence level, abstract level and title level, respectively (compare additional file 5). A notable discrepancy becomes apparent when examining the precision of ligands, which is 86% for direct linkage, but averagely 62% for indirect linkage.
The algorithm was implemented in C++ together with the Qt4 framework . Kinetic information from 16,953,021 PubMed abstracts (2007)  was extracted with a single core application within about 18 hours using a computer with an AMD Turion X2-TL-52 processor with 1.6 GHz. During this time, a maximum of approx. 300 MB RAM was occupied (mainly by the dictionaries).
Range and coverage of the extracted data
Most often extracted organisms in all kinetic categories.
Online access to the database
Comparison of the algorithm
Since the algorithm is dictionary-based, the quality of the identification is limited by the amount and quality of the entries in the dictionaries. By transforming the entries in the dictionaries into small letters, misspellings due to an incorrect use of small and tall letters can be avoided, in contradiction to increasing ambiguity. Considering e.g. the extraction of ligands, where about 20% are falsely linked (see figure 6), more entries in the dictionary will not lead to better results, since a wrong entry has already been linked. Removing false positives from the dictionary will by contrast negatively impact on recall.
However, this kind of algorithm has certain advantages compared to others, e.g. those utilizing machine learning . These are mainly based on Hidden Markov Models [15, 17] using a reading horizon of e.g. three tokens, i.e. a linkage of entities can only be recognized if these are contained within the first three tokens, which slows the recognition of long range relations.
The published algorithms for the development of FRENDA and AMANDA are mainly based on co-occurrences . This concept is based on a statistical significance for linked entities, i.e. pairs of entities often found in direct neighbourhood or within a certain range are linked together. Hence, in contrast to the algorithm in this paper, it is e.g. not possible to gather numbers because of their varying nature.
For the linkage of numbers distinct and explicit values like e.g. 2 molar are necessary. Indirectly mentioned values like "... a value that is 50 times higher than the KM for this substrate."  will therefore not be recognized. However, we are not aware of an algorithm that is capable of gathering such numerical values from sentences.
Characteristics of the database
The comparison of the PubMed IDs manually extracted in BRENDA shows an overlap with KID of 565 for IC50, 9,055 for KM, 2,219 for Ki and 2,694 IDs for kcat, respectively; i.e. about half of the IDs covered in BRENDA are also contained in KID. The ratio of kinetic constant entries to PubMed IDs is higher in BRENDA (about 4 to 1 compared to 2 to 1), which is to a certain extent caused by the fact that whole articles instead of abstract are evaluated.
A further comparison of 100 randomly chosen abstracts which are contained in KID but not in BRENDA reveals, that in 70 cases information was extracted correctly and is not available in BRENDA (see additional file 6). 27 abstracts contain information which was correctly recognized, but is not within the scope of BRENDA, since e.g. a tissue instead of an explicit enzyme is mentioned. The remaining 3 abstracts were false positives, e.g. caused by the use of KM as an abbreviation for "Krushinsky-Molodkina" strain of rats in PubMed ID 6,538,738.
Examining 100 randomly chosen abstracts which are covered by BRENDA but not by KID show a kinetic expression missing in the abstract as the main reason (61 times; see additional file 7). 35 times there is no abstract available for the given PubMed ID and in 4 cases the given expression was not contained in the dictionary of KM-expressions. Hence extending the algorithm to use whole articles instead of abstracts might improve its performance.
A minimum of useful information in terms of enzyme kinetics is available, if e.g. a kinetic expression and its numerical value can be linked to an enzyme (up to 133,774 times or 26% compared to the total number of database entries; see figure 4) or a ligand (up to 169,490 times or 33%). More information is attained by linking both categories together (up to 91,870 times or 18%).
The developed algorithm allowed a very low calculation time compared to other text mining algorithms, with a per-abstract-calculation time of about 4 milliseconds per kilobyte text when acting on PubMed abstracts. Similar algorithms like e.g. BioRAT  and SUISEKI  require 3 to 5 seconds and 0.2 to 0.3 seconds per abstract, respectively, which would lead to a calculated processing time of approx. 600 to 1,000 days and 40 to 60 days, respectively for the amount of 16,953,021 abstracts. The velocity of the algorithm is based on the hash based structure of the dictionary used during identification, which in most cases ensures that each word needs to be treated once, except when the algorithm is forced to fall back when no kinetic flag is found at the last token (compare figure 1). Since the identification of numbers via regular expressions is only applied when a kinetic expression is detected, the increase of time of about one order of magnitude can be neglected when taking into account that this procedure is applied in approx. 3% of the sentences.
Range and coverage of the extracted data
The classification of the KM, Ki and Kd by their numerical values give a clear and unbiased indication of the preferential range for each of these values, which in case of KM and Kd fit into the expected range [28–30].
The classification of organisms contained within the results exhibits a clear majority for animals and bacteria, while plants are represented in smaller amounts. The spreading of the highest amount of single organisms over all categories indicates, that human and rat are the "hot spots" of scientific research.
The short overall calculation time of the KID text mining algorithm and the resulting database prove evidence, that the presented algorithm can be a helpful tool for the annotation and collection of data for other databases like BRENDA.
"KID the KInetic Database" is a valuable help in the field of chemical and biological kinetics. The extent generated by a comprehensive text mining algorithm is comparable to that of databases with manually collected content and provides a reasonable quality marked by its precision and recall. Its major task is to accelerate the research by providing the scientist a large amount of data via its easy searchable web service, so that there is less need to consult written literature.
The approach described in here would be usable for the interpretation of whole publications, not just abstracts (with the exception of tables, which require a separate interpretation). Furthermore, information about enzymes is not restricted to their kinetic character and a further extension for categories to search is conceivable in order to attain even more data about an enzyme.
Availability and requirements
The extracted information is available via the free web service named "KID the KInetic Database http://kid.tu-bs.de. The implementation of the algorithm is available on request from the authors.
The German Federal Ministry of Education and Research (BMBF) is thanked for financing this research.
- Narayanasamy V, Mukhopadhyay S, Palakal M, Potter DA: TransMiner: mining transitive associations among biological objects from text. J Biomed Sci 2004, 11: 864–873. 10.1007/BF02254372View ArticlePubMedGoogle Scholar
- Barthelmes J, Ebeling C, Chang A, Schomburg I, Schomburg D: BRENDA, AMENDA and FRENDA: the enzyme information system in 2007. Nucleic Acids Res 2007, 35: D511-D514. 10.1093/nar/gkl972View ArticlePubMedPubMed CentralGoogle Scholar
- Schomburg I, Chang A, Ebeling C, Gremse M, Heldt C, Huhn G, Schomburg D: BRENDA, the enzyme database: updates and major new developments. Nucleic Acids Res 2004, 32: D431-D433. 10.1093/nar/gkh081View ArticlePubMedPubMed CentralGoogle Scholar
- Schmeier S: Automated recognition and extraction of entities related to enzyme kinetics from text. In Master thesis. Department of mathematics and computer science, University of Berlin, Bioinformatics program; 2005.Google Scholar
- Rojas I, Golebiewski M, Kania R, Krebs O, Mir S, Weidemann A, Wittig U: SABIO-RK: a database for biochemical reactions and their kinetics. BMC Systems Biology 2007, 1: s6. 10.1186/1752-0509-1-S1-S6View ArticleGoogle Scholar
- Zhou W, Smalheiser NR, Yu C: A tutorial on information retrieval: basic terms and concepts. J Biomed Discov Collab 2006, 1: s2. 10.1186/1747-5333-1-2View ArticleGoogle Scholar
- Hakenberg J, Schmeier S, Kowald A, Klipp E, Leser U: Finding kinetic parameters using text mining. OMICS 2004, 8: 131–152. 10.1089/1536231041388366View ArticlePubMedGoogle Scholar
- Feldman R, Sanger J: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge: Cambridge University Press; 2006.View ArticleGoogle Scholar
- Ananiadou S, Kell DB, Tsujii Ji: Text mining and its potential applications in systems biology. Trends Biotechnol 2006, 24: 571–579. 10.1016/j.tibtech.2006.10.002View ArticlePubMedGoogle Scholar
- Kao A, Poteet SR: Natural Language Processing and Text Mining. Berlin: Springer; 2006.Google Scholar
- Jensen LJ, Saric J, Bork P: Literature mining for the biologist: from information retrieval to biological discovery. Nat Rev Genet 2006, 7: 119–129. 10.1038/nrg1768View ArticlePubMedGoogle Scholar
- Ananiadou S, Mc Naught J: Text Mining for Biology and Biomedicine. Norwood: Arctec House, Inc; 2006.Google Scholar
- Spasic I, Simeonidis E, Messiha HL, Paton NW, Kell DB: KiPar, a tool for systematic information retrieval regarding parameters for kinetic modelling of yeast metabolic pathways. Bioinformatics 2009, 25: 1404–1411. 10.1093/bioinformatics/btp175View ArticlePubMedGoogle Scholar
- Corney DPA, Buxton BF, Langdon WB, Jones DT: BioRAT: extracting biological information from full-length papers. Bioinformatics 2004, 20: 3206–3213. 10.1093/bioinformatics/bth386View ArticlePubMedGoogle Scholar
- Blaschke C, Valencia A: The Potential Use of SUISEKI as a Protein Interaction Discovery Tool. Genome Informatics 2001, 12: 123–134.PubMedGoogle Scholar
- Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, Geer LY, Helmberg W, Kapustin Y, Kenton DL, Khovayko O, Lipman DJ, Madden TL, Maglott DR, Ostell J, Pruitt KD, Schuler GD, Schriml LM, Sequeira E, Sherry ST, Sirotkin K, Souvorov A, Starchenko G, Suzek TO, Tatusov R, Tatusova TA, Wagner L, Yaschenko E: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2006, 34: D173-D180. 10.1093/nar/gkj158View ArticlePubMedPubMed CentralGoogle Scholar
- Matsumoto M, Nishimura T: Mersenne Twister: A 623-Dimensionally Equidistributed Uniform Pseudo-Random Number Generator. ACM Transactions on Modeling and Computer Simulation 1998, 8: 3–30. 10.1145/272991.272995View ArticleGoogle Scholar
- Nokia: Qt Software.[http://www.qtsoftware.com/]
- Joomla CMS[http://www.joomla.org/]
- Larrañaga P, Calvo B, Santana R, Bielza C, Galdiano J, Inza I, Lozano JA, Armañanzas R, Santafé G, Pérez A, Robles V: Machine learning in bioinformatics. Brief Bioinform 2006, 7: 86–112. 10.1093/bib/bbk007View ArticlePubMedGoogle Scholar
- Mauck L, Colman RF: Alkylation of cysteinyl residues of pig heart NAD-specific isocitrate dehydrogenase by iodoacetate. Biochim Biophys Acta 1976, 429: 301–315.View ArticlePubMedGoogle Scholar
- Bisswanger H: Multiple Equilibria. In Enzyme Kinetics. 1st edition. Weinheim: Wiley-VCH; 2002:5–49.View ArticleGoogle Scholar
- Stryer L: Enzymes: Basic Concepts and Kinetics. In Biochemistry. 4th edition. New York: W.H. Freeman & Company; 1995:194–195.Google Scholar
- Kandel ER, Schwartz JH, Jessell TM: Ion Channels. In Essentials of Neural Science and Behavior. 1st edition. Norwalk, CT: McGraw-Hill, Appleton & Lange; 1996:115–132.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.