EnzyMiner: automatic identification of protein level mutations and their impact on target enzymes from PubMed abstracts
© Yeniterzi and Sezerman. 2009
Published: 27 August 2009
Skip to main content
© Yeniterzi and Sezerman. 2009
Published: 27 August 2009
A better understanding of the mechanisms of an enzyme's functionality and stability, as well as knowledge and impact of mutations is crucial for researchers working with enzymes. Though, several of the enzymes' databases are currently available, scientific literature still remains at large for up-to-date source of learning the effects of a mutation on an enzyme. However, going through vast amounts of scientific documents to extract the information on desired mutation has always been a time consuming process. In this paper, therefore, we describe an unique method, termed as EnzyMiner, which automatically identifies the PubMed abstracts that contain information on the impact of a protein level mutation on the stability and/or the activity of a given enzyme.
We present an automated system which identifies the abstracts that contain an amino-acid-level mutation and then classifies them according to the mutation's effect on the enzyme. In the case of mutation identification, MuGeX, an automated mutation-gene extraction system has an accuracy of 93.1% with a 91.5 F-measure. For impact analysis, document classification is performed to identify the abstracts that contain a change in enzyme's stability or activity resulting from the mutation. The system was trained on lipases and tested on amylases with an accuracy of 85%.
EnzyMiner identifies the abstracts that contain a protein mutation for a given enzyme and checks whether the abstract is related to a disease with the help of information extraction and machine learning techniques. For disease related abstracts, the mutation list and direct links to the abstracts are retrieved from the system and displayed on the Web. For those abstracts that are related to non-diseases, in addition to having the mutation list, the abstracts are also categorized into two groups. These two groups determine whether the mutation has an effect on the enzyme's stability or functionality followed by displaying these on the web.
Enzymes are mostly protein based biomolecules that accelerate the rate of chemical reactions in a living organism. Enzymes are made of amino acids whose unique characteristic composition enables them to have different functionalities and also making them work efficiently at stable conditions, such as optimum temperature and pH. Thus, any mutation that occurs in this amino acid sequence may change the enzyme's 3D-structure, catalytic activity or stability, or even making the enzyme completely non-functional. Therefore, the knowledge of mutations and their impacts are of crucial importance in order to completely understand mechanisms of enzymes' functionality, and stability.
Many experimental studies have focused on finding the effects of mutations on enzymes. For example, many studies have aimed to create enzymes with novel properties such as designing hyperthermophilic enzymes  or expanding the substrate specificity of an enzyme . The published results of these projects provide scientific information for researchers who are engaged in finding an impact of mutation on an enzyme. Though many databases are available on the nomenclature of enzymes [3, 4] or structure and function [5–11], to our knowledge only BRENDA (BRaunschweig ENzyme Database) [12, 13], the largest manually curated enzyme-specific information system, contains an information on engineered enzymes and their effects on the enzyme's catalytic activity while directly referring to scientific literature. Manually curated databases are both slow and expensive for extracting information from scientific literature. There is a need for an efficient automatic extraction method that allows accessing relevant information rapidly with great efficiency, and possibly at any time.
With the latest developments in information extraction, biomedical term recognition has become an important area for researchers. Dictionary-based, rule-based, and machine learning-based approaches are used to extract names of genes, proteins and other cellular substances [14–16]. Several systems have already been developed for automatic extractions of mutations from biomedical literature. MuteXt, for example, developed by Horn et al.  is one of the initial works that focused on extracting single point mutations from scientific literature. Moreover, a gold standard data set  is created for comparing the performance of mutation extraction systems and systematic evaluations [18, 19] for these systems are developed with a precise definition of evaluation metrics. The next step in mutation informatics is finding relation of mutations to other biological entities such as genes or proteins.
Rebholz-Schuhmann et al.  developed MEMA, which extracts disease-related mutation-gene pairs from Medline abstracts. In MEMA, identification of both gene names and mutations are based on regular expressions compiled into two different finite state automations. If the abstract or the sentence that the mutation is extracted from contains only one gene name, the detected mutation is associated to that particular gene. However, if there is more than one gene name, the MEMA uses syntactical rules and proximity parameters as a criterion for decision. MuGeX (Mutation Gene eXtractor) uses a similar approach developed by Erdogmus et al.  in order to extract disease related protein mutations. MuGeX makes use of regular expressions in identifying possible mutations. However, it also handles ambiguous mutation citations by using machine learning techniques. For gene name identification, it uses a dictionary-based approach and then associates the extracted entities according to proximity measures.
For mutation-protein associations, Lee et al.  developed Mutation GraB, which identifies mutations using regular expressions, similar to the previous methods [20, 21]. Protein identification is also performed with regular expressions, which search for a dictionary of protein names and synonyms. Lastly, Mutation GraB uses graphs in which shortest-distance search and word bigram analysis are used in order to find the associations between mutations and proteins. MutationMiner which is developed by Baker et al. [23, 24] follows a different approach than the previous systems. It mainly focuses on associations between mutations and protein structure visualizations using NLP techniques. The system identifies the proteins and mutations in the form of name entities and if cited in the same sentence, the MutationMiner associates them to one another. Moreover, this system has been improved of late with the support of biological ontologies which make mutation annotations available in a semantically consistent format, and with the OWL ontology which enables the automated means of accessing knowledge possible [25, 26].
The above information extraction techniques became necessary because of the increased number of electronic documents. At the same time, however, the task to classify these documents based on their contents makes document classification an important field for researchers. Especially after integrating machine learning techniques to document classification, its accuracy has now became comparable to the less than 100% accuracy of human expertise [27, 28]. Therefore, because of growing interest and high accuracy rates, document classification has been used in different applications such as document organization [27, 29], word sense disambiguation  or web document classification [31, 32].
Although the above works make it possible to associate mutations to other biological entities, the experimental results documented in scientific literature cannot be extracted with the techniques discussed above. Therefore, in case of mutation informatics, the next step is to extract information describing the effects of the mutations [17, 19, 33]. The writers of this paper developed EnzyMiner, which is capable of automatically extracting protein mutations from PubMed  abstracts for a given enzyme and classifying their impact on the enzyme's functionality and stability. In the case of mutation identification, the information extraction and document classification methods are used. For impact analysis, document classification techniques are again used for identifying the abstracts that contain a change in the stability conditions or catalytic activity of an enzyme resulted from a mutation.
In this work, we focus on amino acid level protein mutations. In order to extract this kind of mutation from PubMed abstracts, we used MuGeX 's Mutation Extraction and Disambiguation Modules which are based on regular expressions and machine learning techniques. A set of 20 patterns was formed using phrases that contain protein mutations for regular expression matching in MuGeX. Next, pattern matching was applied to each sentence of the abstracts in order to identify mutations.
One major drawback of using regular expressions is that they are too general to be able to capture only the protein level mutations. A nucleotide mutation such as G32A, or the name of a strain or a cell line, such as H4S may easily be misinterpreted as a protein mutation. In order to eliminate such ambiguities, MuGeX makes use of machine learning techniques. Document classification on topic and content sections of these abstracts identifies them as to whether or not a protein mutation is present.
Enzymes are essential for a number of functions in the cell; therefore, any malfunctioning of an enzyme caused by a mutation may be found in relation to a disease. Thus, many research projects are conducted to find the relation between enzyme mutations and diseases. Since, MEDLINE is the largest component of PubMed, all the abstracts of these medical projects are included in the PubMed database. Therefore, when we download all the abstracts that contain the term "mutation" and a specific enzyme name, the medical abstracts are also accompanied in the downloaded information. However, these abstracts contains the general focus on the mutations' effects on the disease development, but not on the functionality or the stability of enzymes. For instance, lipoprotein lipase hydrolyzes lipids in lipoproteins. Mutations in the gene encoding for lipoprotein lipase can lead to lipoprotein lipase deficiency that in turn leads to an increase in the levels of triglycerides in the bloodstream. Therefore, many abstracts that contain both the words "mutation" and "lipase" are about research aiming to find the specific mutations that cause this disease. Similarly, the disease related abstracts do not contain the information on direct impact of the mutation on the enzyme functionality or stability. For that reason, including these abstracts into our classification will mislead our results. In order to prevent corruption of the data, we initially classify the abstracts into two groups: (a) disease related abstracts and (b) non-disease related abstracts. Dictionary based approaches  that make use of Medical Subject Headings (MeSH) or Unified Medical Language System (UMLS) are previously used for identifying disease related abstracts. On the other hand, various document classification algorithms are also employed for clinical text classification between disease related documents such as clinical patient records . In EnzyMiner, we used document classification approach for grouping our abstracts into disease related or non-disease related abstracts. Therefore, user is giving the option of choosing disease related or non-disease related abstracts, and base of their selection a list of mutations and their impacts on the enzyme are displayed on the web with direct references to the relevant scientific literature.
Enzymatic reaction rates depend on physico-chemical conditions, such as pH and temperature. When these conditions are optimum, the enzyme attains its maximal activity. However, at high temperatures or extreme pH changes, the enzyme may become inactive as a result of denaturation. Stability is the condition in which an enzyme can maintain its structural conformation and activity, yet, a mutation may cause changes in these conditions. For instance, a mutation that decreases the molecular flexibility of an enzyme sometimes may lead to higher thermostability, or a mutation that replaces Glycine with a basic or acidic residue may cause significant changes in the enzyme's optimum pH. Therefore, to understand the governing rules of protein structure stability, researchers carry out mutation studies and check the impact on stability at different temperatures and pH.
Likewise, an enzyme's functionality is also affected by mutations. Unlike most of the chemical catalysts, enzymes are highly selective to their substrates and this specificity is mainly determined by three dimensional coordinates of the active site. Therefore, a mutation that modifies the conformation of an enzyme, especially the catalytic site, may cease its function. A mutation that changes the spatial coordinates of the active site can change enzyme specificity for a specific molecule. On the other hand, all the mutations may not necessarily have a significant effect on the enzyme's stability conditions or catalytic functions. There are experimental results, for example, which show that some mutations have no impact. The abstracts that include only these kinds of mutations have to be filtered out since our main purpose is to find the abstracts that contain impacts of mutations on enzymes. We represented this challenge as a classification problem and removed the abstracts that are classified as no change abstracts from our data set.
While in the Preprocessing stage, the downloaded PubMed abstracts are first parsed, and then tokenization followed by sentence splitting operations were performed on these parsed abstracts for the successive stages.
As for as the mutation extraction stage is concerned, we used MuGeX  which is developed by our group to extract disease related mutation-gene pairs. MuGeX consists of two stages and each stage has several modules. Since this work requires us to identify only protein mutations, we therefore used Mutation Extraction and Disambiguation Modules of MuGeX. Mutation Extraction Module makes use of pattern matching with regular expressions to find the possible protein mutations from the text. All of the regular expressions are based on one pattern that starts with an amino acid as the one letter code followed by a number, and ends with another amino acid as another one letter code (e.g. W16A). By making modification to this pattern, a set of 20 patterns was formed in order to find remaining mutation patterns.
Regular expressions are able to identify protein mutations. However, as indicated previously, the main challenge in amino-acid-level mutation extraction is to distinguish actual protein mutations from nucleotide mutations, or mutation like terms such as the name of a strain or a cell line. Disambiguation Module of MuGeX eliminates these kinds of ambiguities by using document classification as a word sense disambiguation application. This module uses Rainbow , one of the front ends of the Bow library designed for document classification, so as to decide the abstracts that contain potentially ambiguous mutations. A training benchmark that consists of 3,600 randomly selected Medline abstracts is formed and all of these abstracts were labeled by experts. Out of these, a total of 2771 abstracts contained protein mutations while the remaining abstracts contain either nucleotide mutations or biological entities that are cited with mutation-like terms. Using these labeled abstracts, the Disambiguation Module trains a model for classification while processing the abstracts by performing stemming, alphanumeric tokenization and considering word bigrams.
After the preprocessing stage, EnzyMiner identifies all possible protein mutations in the downloaded abstracts using the Mutation Extraction Module. Then the ambiguous abstracts are queried to the Disambiguation Module where they are classified using Naive Bayes algorithm, which is trained with the above model. At the end of this stage, abstracts that contain protein mutations are queried to the classification modules of the Impact Analysis Stage for further classification.
In all the classification modules of the Impact Analysis Stage, Rainbow was used as a document classifier. It assigns a document to the class with the highest score. However, the scores of the other classes are also given in the results. In the classification modules such as Disease vs. Not Disease Classification and Change vs. No Change, the abstracts in the test set are assigned to the class with the highest classification score. On the other hand, for the Stability vs. Catalytic Classification Module, before assigning an abstract to one of the classes, the difference between the scores of the two classes was checked. If the difference is less than 0.05, such abstract was assigned to both classes, because an abstract may contain a change in both the stability as well as the catalytic activity. Furthermore, it is to be noted that in the classification modules, all possible words from the vocabulary are adapted as a feature during the classification.
All the steps at the Preprocessing Stage are developed using Flex  and Bison . In the Mutation Extraction Module, all regular expressions were compiled with the help of a C++ regular expression library. The Disambiguation Module makes use of Rainbow  in order to identify the abstracts that contain potentially ambiguous mutations. Similarly, Rainbow was used as a document classifier in all the classification modules of the Impact Analysis Stage.
During the steps shown in the Figure 1, all necessary information for the downloaded abstracts are inserted into a database. Whenever an user queries to find impacts of mutations on an enzyme's stability and functionality, information relevant to that specific enzyme can be retrieved from the system and displayed on the Web (Figure 2, 3).
In order to measure the performance of EnzyMiner, the system was employed with the query enzymes "lipase" and "amylase". These two enzymes perform essential roles in human digestion system, and deficiency of these enzymes may lead to disorders in human body. Moreover, they have many applications in chemical and food industry. Therefore, it is easy to find both disease related and non-disease related abstracts that contain one of queried enzymes. As indicated before, possible variations of the term "mutation" was included into our search query while developing the EnzyMiner system. However, in the evaluation step, only the PubMed abstracts that contain the terms "mutation" or "mutations", and either one of the queried enzymes were downloaded. As a result, our experimental dataset showed 393 abstracts that contain the term "lipase" and 126 abstracts showed the term "amylase". All of the abstracts are manually curated by the experts for three classification schemes.
The Impact Analysis Stage of EnzyMiner consists of 3 classification modules: (a) Disease vs. Non-Disease Classification, (b) Change vs. No Change Classification, and (c) Stability vs. Catalytic Classification Modules. Before detailing the experiments performed at this stage, stating the classification algorithms and evaluation metrics used in these modules will provide better understanding of the evaluation.
In the classification stages of EnzyMiner, Rainbow  was used for document classification. It provides several classification algorithms and processing options. In order to decide which classification algorithm was more suitable, the classification performances of four algorithms (Naive Bayes, SVM, Probabilistic Indexing and Rocchio with TF-IDF weighting) were investigated. Moreover, since some of these algorithms adopt the bag-of-words approach, eight different processing models were built to observe the impact of tokenization, stemming, and the use of n-grams. In these eight models, the effect of stemming, removing commonly observed morphological and inflectional suffixes from words, and the effect of using unigrams and bigrams were tested. In addition, to observe the effect of tokenization, two types of procedures were used, these are, white space tokenizer and alphanumeric tokenizer. The white space tokenizer delimits the tokens by a whitespace only, while the alphanumeric tokenizer delimits the tokens by only nonalphanumeric characters .
As stated before, systematic evaluations for mutation extraction systems are developed with a precise definition of evaluation metrics. Although not many systems still analyze the impact of mutations, evaluation metrics for such systems were also defined. Since, the general idea for impact analysis is to perform information extraction, the standard precision, recall and F-measure are ideal parameters for evaluation . We decided to use also the accuracy measure in our evaluations because we approached to this problem as a document classification problem.
Furthermore, in order to test the effectiveness of the classification modules, we followed the train-and-test and k-fold cross validation approaches together. For each classification module, 20% of the abstracts were chosen as a test set. The remaining abstracts were used as a training set and the models are trained only with these abstracts. First of all, using only the training set, 3 fold classification was performed fifty times and the average accuracy was represented as the training set accuracy. This step was repeated for all four classification algorithms and eight models. The classification algorithm with the highest accuracy model was used in the test step, where the abstracts from the test set were queried for classification. The overall accuracy of this step was represented as the test set accuracy.
Experimental results of Disease vs. Non-Disease Classification Module
Disease vs. Non-Disease Classification
# of Abstracts
Experimental results of Change vs. No Change Classification Module
Change vs. No Change Classification
# of Abstracts
Experimental results of Stability vs. Catalytic Classification Module
Stability vs. Catalytic Classification
# of Abstracts
After the Disease vs. Non-Disease Classification, if the user specifically indicates an interest in a disease related abstract, the abstracts that are assigned to this class are displayed on the web with a list of protein mutations they contain. If we specifically look at the mutation extraction performance on these abstracts, our accuracy was 92.2% with 95.2% precision, 86.8% recall and 90.7 F-measure. On the other hand, if the user is not interested in a disease related abstract, the abstracts that are assigned to the non-disease related class were then continued to be processed in Change vs. No Change Classification Module.
In this module, another classification identifies the abstracts that do not contain any change in the enzyme's stability or functionality. As shown in Figure 5, the highest classification accuracy of the training set was obtained using Model 3 (White space tokenizer, stemming, unigram) with 91.3% accuracy and 95.4 F-measure, and the results of this experiment are given in Table 2. Using this model, 91.7% accuracy and 95.5 F-measure were obtained in the classification of the test set and at the end of this step, abstracts that were classified as no change are eliminated.
In the last step, the remaining abstracts are classified into two groups, these are, abstracts that contain information on a change in the enzyme's stability, or abstracts that discuss the changes in the enzyme's catalytic activity. When this classification was performed on the training set, the highest accuracy measure 94.4% and 96.6 F-measure were observed with Model 5 (Alphanumeric tokenizer, stemming, unigram), and thus a classification was performed on the test set using this model. At the end, it was observed that all the abstracts in the test set were correctly classified. As illustrated in Table 3, the experimental results of both the training and test sets are shown to be very high, which is expected since the classification model successfully identifies the most informative words such as "thermostability", "pH", "temperature", "sensitivity" and "specificity" as well as increase their frequency with stemming.
It has been observed that classification modules respond to the different processing options in a more or less similar way. However, highest training set accuracies were observed in different models for each classification module. It shows that processing options have different effects on the classification of different concept. The only common part of these three models was stemming, which is an expected result, since it reduces the dimensionality of the term space which is a good thing for systems that have small training sets  such as EnzyMiner.
Experimental results on Amylases
# of Abstracts
# of Correctly Identified Abstracts
Disease vs. Non-Disease Classification
Change vs. No Change Classification
Stability vs. Catalytic Classification
In this paper, we described EnzyMiner, an automated system designed to identify the PubMed abstracts that contain information on the impact of a protein level mutation on the stability and the activity of a given enzyme. To our knowledge, besides the manually curated enzyme database BRENDA, there is no tool or database that provides the same information like the EnzyMiner. Although, we performed our experiments on lipases and amylases, the EnzyMiner can be applied to other enzymes without any modification. For mutation extraction, we used MuGeX which handles ambiguous mutation citations successfully, and thus has a high accuracy and F-measure. In the case of impact analysis of mutations, EnzyMiner uses document classification first to separate disease related abstracts, then to eliminate abstracts that do not contain any change in enzyme's stability or catalytic activity, and lastly to classify the remaining abstracts according to the impact of the mutation on the stability or on the catalytic activity.
Currently, EnzyMiner does not give any information about the direction of the impact, whether there is an increase or decrease in the stability conditions or catalytic activity. Such a specification cannot be handled as a classification problem because there are abstracts that contain several mutations, each effect the stability condition or catalytic activity of an enzyme in different directions. Moreover, EnzyMiner does not specify the kind of change in the stability conditions. It cannot distinguish between the changes in pH and temperature values. Similarly, in case of functionality, it cannot differentiate a change in the enzyme's specificity from a change in the sensitivity. Another limitation of EnzyMiner is, if an abstract contains several enzymes and protein mutations, it cannot associate the enzymes to the corresponding mutations, since EnzyMiner do not use proximity measures. Therefore, in the future, natural language processing techniques should be employed in order to overcome the above limitations and extract more specific information about the impact of a mutation on an enzyme.
The EnzyMiner system and the gold standard corpus for the Impact Analysis Stage are available under the URL http://bioapps.sabanciuniv.edu/enzyminer/.
The authors would like to thank Fırat Tahaoglu for programming the web interface, Barış Tümer for providing the technical support for our system, Aydın Albayrak for labeling the abstracts for three classification schemes, and Javed H. Niazi KM and Nancy Karabeyoglu for reviewing the manuscript.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 8, 2009: Proceedings of the European Conference on Computational Biology (ECCB) 2008 Workshop: Annotation, interpretation and management of mutations. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/10?issue=S8.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.