Workflow and auxiliary means
The DRENDA work flow is based on the disease mining procedures already included in BRENDA [13] but strongly enhanced inter alia by the added classification feature. The work flow (Figure 1) is implemented in Python and uses MySQL as database back-end. The first step is the initial sentence splitting (Figure 1, Sentence splitting) and search for the co-occurrence of enzyme and disease terms (Figure 1, Co-occurrence matching) within the title or an abstract sentence of the references enlisted in PubMed. The BRENDA enzyme names and synonyms, derived from manual literature annotation, comprise currently around 100,000 terms. This collection is certainly the largest available reference dictionary for the identification of enzyme entities (Figure 1, BRENDA).
The terms for disease recognition are taken from the Me dical S ubject H eadings (MeSH) [14]. The selected disease terms are a subset containing about 22,000 disease names and synonyms (Figure 1, MeSH). The titles and sentences, where at least one co-occurring enzyme and disease entity could be found (Figure 1, Result set co-occurrence) are the basis of the DRENDA content and stored as corpus for the subsequent classification.
Relation classification
The classification is accomplished by machine learning methods. The basic classification unit is a sentence or title from PubMed abstracts where at least one enzyme and disease entity could be found (Figure 1, Result set co-occurrence). As they contain only a small number of words the problem is characterised by a sparse feature representation and offers the possibility to perform the classification via a S upport V ector M achine (SVM) that have proven to perform well on text classification tasks [15]. We have used SVMlight [16] as SVM implementation. In order to use an SVM for classification the availability of training and testing material is an essential requirement. For this purpose a corpus has been composed (Figure 1, Test corpus; Train corpus), which contains about 5,031 sentences and titles derived from PubMed abstracts. The selection out of all sentences and titles was made accordingly the following conditions: 2,500 have been randomly selected where at least one enzyme entity was identified. 2,000 have been randomly selected where at least one enzyme entity and one disease entity was identified. About 500 were randomly collected without any precondition from PubMed abstracts and titles. This corpus was manually annotated by an expert with biomedical background, for the identification of the presence of enzyme and disease entities and the assignment of all categories. If both entities are present within one sentence or title the relation was manually classified into the four categories of causal interaction, therapeutic application, diagnostic usage and ongoing research. In addition to the deployment as training and test data it is also used to evaluate the co-occurrence matches (Figure 1, Evaluation).
Classification categories
Causal interaction
The crucial role of enzymes in catalysing metabolic reactions in organisms implies that malfunction, directly caused by a mutation in the coding gene or indirectly by the presence of inhibitors (e.g. side effects of drugs) or the absence of required cofactors, most often induces pathological conditions. The category causal interaction comprises references which describe such a relation between an enzyme and disease entity. The training/test corpus contains 1,382 sentences/titles which are annotated for the class causal interaction.
Example sentence for this category:
„Chronic granulomatous disease (CGD) results from mutations of phagocyte NADPH oxidase.“ [17]
Ongoing research
In a large number of publications a distinct interrelation of enzymatic function and the development or progress of a disease is presumed but not yet fully proven and further research investigation is needed. This instance describes the category ongoing research. 587 sentences of the training/test corpus are assigned to this category.
Example sentence for this category:
„The prognostic significance of epidermal growth factor receptor (EGFR) expression in lung cancer and, more importantly, its ability to predict response to anti-EGFR therapies, are currently subjects of active research.“ [18]
Diagnostic usage
In a clinical laboratory a multitude of parameters are examined for clarifying the presence of a pathological state or its severity. Many of these methods are based on the measurement of enzyme activities in the specimen.
Gamma-glutamyltransferase (EC 2.3.2.2) for instance, catalyses the transfer of a glutamyl residue from glutathione to an acceptor amino acid and its change in activity is used e.g. as an indicator of a liver dysfunction [19] which may be caused by excessive alcohol consumption. If an author states that an examination of an enzyme like the measurement of its activity, the test for its presence or the assay of its functional characteristic parameters are part of the diagnostic course of action, this reference is considered as included within the DRENDA category diagnostic usage. In the annotated training/test corpus 477 sentences/titles are in this category.
Example sentence for this category:
„Prostate-specific antigen (PSA) is the most clinically useful tumour marker available today for the diagnosis and management of prostate cancer.“ [20]
Therapeutic application
Given that enzymes play a major role in the development and progress of many diseases, they are also of relevance from the therapeutic point of view. On the one hand enzymes are considered as targets for drugs, on the other hand the enzyme itself may be the drug. If the enzyme is either a drug target or the drug agent for a mentioned disease, the corresponding reference is allocated to the category therapeutic application. The training/test corpus contains 366 sentences/titles corresponding to this category, about 62% of these sentences/titles describe cases where the enzyme is a drug target during the therapeutic intervention against the disease.
Example sentence for this category:
„Indinavir sulfate is a human immunodeficiency virus type 1 (HIV-1) protease inhibitor indicated for treatment of HIV infection and AIDS in adults.“ [21]
Preprocessing and representation
In order to form a suitable input for SVM classification the set of titles and sentences, which were classified to contain a co-occurrence of an enzyme and disease, are prepared in a preprocessing step (Figure 1, Preprocessing).
First, terms that represent the entities "enzyme" and "disease" are removed (removal) or replaced (replacement) by a general term representing all enzyme respectively disease terms. This means that in the removal method all terms are deleted which are known to represent names and synonyms of enzymes or diseases without any substitution. In the replacement method the deleted names are replaced by a single generic term for all diseases and one generic term for all enzymes. Given that these two generic terms not have been present in the sentences and titles of the result set co-occurrence before their insertion during the preprocessing process.
The titles and sentences are represented as feature vectors within a document space [22]. The feature vector representation of a sentence or a title is performed via a calculation of the term frequency inverse document frequency tf-idf. The term frequency tf
ij
is the number of occurrences of the term t
i
in the particular sentences or title d
j
. The inverse document frequency idf
i
is calculated as
where D is the overall number of sentences and titles and d
it
is the number of sentences and titles containing t
i
. tf-idf is the product of tf
ij
and idf
i
.
Finally, all coordinates are divided by the length of the feature vector. Thereby each feature vector is converted to unit vector length, neutralizing the influence of the different sentence lengths.
Classification of the co-occurrence results
As the next step of the described DRENDA work flow a classification of the co-occurrence results is performed. The test/training corpus has been randomly split into 4/5 training examples and 1/5 test examples. Every discrete classification category contains the same quantity of positive and negative annotated sentences/titles. The overall number of examples differs between each classification category due to the amount of positive assigned sentences/titles in each category. All sentences/titles which were positively annotated for one category are joined with the same amount of sentences/titles, negatively annotated for this category and containing at least one co-occurrence of an enzyme and disease entity.
A classification model (Figure 1, SVM training) is calculated and used to perform the classification of co-occurrence results (Figure 1, SVM classification). The classification results are finally distributed into a four level confidence system with descending classification precision and specificity from level 4 to 1.
Cross validation
Before establishing the described DRENDA work flow (Figure 1) a five-fold cross-validation was performed to test the performance of the preprocessing steps and methods (removal; replacement) and the appropriate parameter choice for SVMlight. The overall ability of classification and assignment of sentences to the predefined categories was validated. In every category, causal interaction, therapeutic application, diagnostic usage and ongoing research, the same quantity of positive and negative annotated sentences/titles was chosen out of the training/test corpus and split randomly into five sets of equal size. Each individual set served once as a testing set while the other four formed a training set. Several calibrations of SVMlight parameters were tested as well as all four default kernel functions implemented in SVMlight (linear, polynomial, radial basis, sigmoid) were tested. This led to 2,688 distinct parameter combinations processed for each preprocessing method, removal and replacement. The classification of the co-occurrence results in the DRENDA work flow is processed once and not according to the cross validation scheme, due to processing time.
Evaluation measures
In order to evaluate the correctness of the results derived from the co-occurrence based entity recognition (Figure 1, Result set co-occurrence) and the entity relation classification (Figure 1, Classified sentences and titles) steps in the DRENDA work flow some commonly used measures, including precision, recall, accuracy, specificity, F1 score and Matthews correlation coefficient (MCC), were calculated based on the test sets formed from the manually annotated corpus (Figure 1, Test corpus). The correct estimated presence of an enzyme and disease entity in the co-occurrence is counted as a true positive (tp), the correct estimated absence of one or both is counted as true negative (tn). If there are no co-occurring enzyme or disease entities or even one entity is missing but erroneously denoted as co-occurring it is considered as a false positive (fp) prospect. The case of co-occurring entities present which are not found by the routine is considered as false negative (fn). In the classification the correct assignment of a sentence to one of the classification categories is assessed as tp. The incorrect assignment to a classification category is assessed as fp. The correct neglected assignment to a classification category is assessed as tn and the incorrect neglected assignment as fn. The evaluation measures are defined as:
In the five-fold cross-validation the values are also accompanied by the standard deviation.
In order to evaluate the benefit of a threshold variation to the predictive performance of the classification models receiver operating characteristic (ROC) curves have been plotted for the results of the five-fold cross-validation. In these ROC curves the average recall, or true positive rate, was plotted versus the false positive rate. The ROC plots and the calculation of the area under the curve (AUC) were performed by the use of the R package ROCR [23].