EnzymeDetector: an integrated enzyme function prediction tool and database

Background The ability to accurately predict enzymatic functions is an essential prerequisite for the interpretation of cellular functions, and the reconstruction and analysis of metabolic models. Several biological databases exist that provide such information. However, in many cases these databases provide partly different and inconsistent genome annotations. Description We analysed nine prokaryotic genomes and found about 70% inconsistencies in the enzyme predictions of the main annotation resources. Therefore, we implemented the annotation pipeline EnzymeDetector. This tool automatically compares and evaluates the assigned enzyme functions from the main annotation databases and supplements them with its own function prediction. This is based on a sequence similarity analysis, on manually created organism-specific enzyme information from BRENDA (Braunschweig Enzyme Database), and on sequence pattern searches. Conclusions EnzymeDetector provides a fast and comprehensive overview of the available enzyme function annotations for a genome of interest. The web interface allows the user to work with customisable weighting schemes and cut-offs for the different prediction methods. These customised quality criteria can easily be applied, and the resulting annotation can be downloaded. The summarised view of all used annotation sources provides up-to-date information. Annotation errors that occur in only one of the databases can be recognised (because of their low relevance score). The results are stored in a database and can be accessed at http://enzymedetector.tu-bs.de.


Background
A large number of online accessible biological databases provide genome annotations for a wide variety of organisms. Among the most frequently used resources are the RefSeq database from the National Center for Biotechnology Information (NCBI), the Kyoto Encyclopedia of Genes and Genomes (KEGG) [1][2][3], the PEDANT protein database [4][5][6], and the UniProtKB database [7]. In addition, specialised databases exist that focus on a specific group of organisms, for example the Pseudomonas Genome Database V2 [8] for Pseudomonas strains.
Hand-curated annotations are available only for wellinvestigated model organisms. To annotate the genomes of other organisms, the databases mainly use computational annotation tools with information on the implemented quality criteria being not always specified. There are obvious inconsistencies between pathway databases [9], and other databases providing predicted information on enzyme functions [10]. This is partly due to the fact that the automated annotation of enzyme functions is still a challenging task [11]. Additionally, the annotations may have been computed at different times, hence being based on different states of knowledge. In addition to the uncertainties introduced with gene prediction functional assignment often rely on dubious assignments arising from either errors made in manual annotations or transferred errors in automatic function predictions [10]. This leads to a high degree of inconsistency in the predicted enzyme functions.
In addition to the mentioned main annotation hosts, a number of annotation tools are available partly giving reliability scores, and some that integrate different sources. For example PRIAM [12] predicts enzyme functions based on sets of sequence profiles that have been computed for the entries of the ENZYME database, being an annotation source that may be integrated in a future version of EnzymeDetector. EFICAZ [13,14] is also based on residue patterns. It can be obtained as a stand-alone tool or accessed via a web interface. With EFICAZ it is possible to integrate annotation data from an external source. But only the data of the KEGG database can be integrated and no other sources.
Yang et al. [15] suggested an annotation confidence score based on sequence comparisons with some reference organisms. The tool presented by Chitale et al. [16]. delivers an annotation and a corresponding reliability score. As a serious drawback the user has to analyse the sequences one by one.
Within Apollo [17] and the UCSC Genome Browser database [18] it is possible to integrate annotation sources, but only with respect to the genomic positions of the genes and not on the available function predictions.
In order to easily access function annotations, life scientists currently have the choice between two different procedures. They either use one of the databases and may have to accept a serious loss of accuracy, or they manually compare different annotations. By selecting one data source, the result depends, among other factors, on the update cycle of the annotation host. Especially for the construction of metabolic models, the accuracy of the model strongly depends on the quality of the primary resources and the gene function prediction [19]. Even one missing enzyme function can be highly critical, because it might have a great impact on the whole model. As stated by Schnoes et. al., the annotation errors in public databases are a problem that should not be underestimated, since these errors are propagated over time [20]. In the manual evaluation of discrepancies between the sources, the scientist has no clear criteria for decision. In order to solve this problem and to give the scientist a fast overview, specialised tools that annotate, integrate, and mine the available information, are necessary [19].
For this purpose, the program EnzymeDetector was created. It includes a reasonable and comprehensible scoring scheme, and combines the information of the major databases, a frequently updated BLAST-based annotation, and a sequence pattern search. It provides the possibility to obtain a fast overview of the possible annotations for each gene and additional help to distinguish between their qualities. The advantage over previously described tools is given by the fact that the scientist does not have to manually analyse single sequences, but has the data for the whole genome pre-calculated in a database. Furthermore the database is easily accessible and can be downloaded. Although a background knowledge of functional annotation is very helpful, the tool EnzymeDetector can even be used by life scientists, not familiar with bioinformatics.

Construction and content
An overview of the different parts of the EnzymeDetector program is shown in Figure 1.
We used nine different prokaryotic genomes as training data to determine optimal thresholds and default values. The statistics shown in this manuscript were done for those organisms as well. The nine organisms are Corynebacterium glutamicum ATCC13032, Dinoroseobacter shibae DFL12, Escherichia coli K12 MG1655, Figure 1 Scheme of the work flow of EnzymeDetector. The different annotations, shown in green, are either obtained from biological databases or obtained by a BLAST-based function prediction. As additional information sources BRENDA, AMENDA and BrEPS were included, shown in yellow. All informations together form the enzyme function data pool, shown in red. This data pool is stored in a SQL database, but can be accessed via a web interface as well.

Data collection
As a first step, the program collects and stores enzyme function annotations from different databases. Currently, the program uses data from NCBI, KEGG, PEDANT, a database specialised on Sulfolobus [21], the Pseudomonas Genome Database V2, and the annotation data found in Swiss-Prot [7]. The annotation of other databases can easily be added by including a respective parser.
Additional annotation via a self-performed BLAST search against UniProtKB As a second step, the program performs a BLAST analysis using all protein sequences of the organism as input sequences. The version 2.2.25 of the NCBI BLAST algorithm [22] is used. The search is performed against all protein sequences of the UniProt database [7]. The resulting hits are automatically evaluated, yielding the BLAST-based annotation.
Three criteria were taken into account for the evaluation of the BLAST hits: -The completeness of the Enzyme Commission numbers (EC numbers): Incomplete EC numbers are ignored if other hits with complete EC numbers exist for the respective gene, because the necessary information on substrate specificity is not contained in incomplete EC numbers.
-The expectation value (E-value): For a conclusive function annotation the best BLAST has to have an E-value more than thirty orders of magnitude smaller than the Evalue of the next best hit. If there are several hits presenting an E-value in the range of thirty orders of magnitude compared to the overall best hit, all of those hits are marked as candidates. Subsequently, these hits are assumed to be within the 'relevance range'. The value of thirty orders of magnitude was based on an evaluation of all BLAST hits of the nine organisms used as training data against the Swiss-Prot annotation. With the chosen value an optimal prediction was reached. About 99% of the enzymes annotated in Swiss-Prot were predicted in this way with only 7% of false positives (additional enzymatic activities for enzymes with a given EC-number in Swiss-Prot).
-The occurrence of the EC numbers: A cut-off value of 5 for the number of homologous sequences was chosen. If a certain EC number occurred more than 5 times in the list of all BLAST hits, it was considered to be relevant. This way, the inclusion of hits based on incorrectly annotated sequences is less likely. We chose a cut-off value of only 5 in order to prevent the loss of valuable information. With a manual analysis of the results of some BLAST searches, we found that with a higher cutoff value important information was lost. This information often proved to be crucial for model developers.
For every gene all EC numbers are stored, that are complete, within the 'relevance range' and have a relevant number of occurrence. If only hits with a low frequency were found, they were nevertheless accepted. This way new results were not rejected.

Searching BRENDA and AMENDA
Specific experimental enzyme information from the enzyme databases BRENDA and from AMENDA [23] is added. The information in BRENDA is hand-curated and has a very high reliability. But the information is not connected to a specific enzyme sequence, if that information is not available in the original paper. This has to be considered analysing the EnzymeDetector result tables, which contain gene-enzyme combinations. When only a BRENDA/AMENDA annotation was found without a gene information, the result was marked as 'not sequence related'.

Pattern search
The program BrEPS [24] performs a pattern-related enzyme annotation based on consensus sequence patterns. To analyse an organism, its protein sequences were searched against the pattern database, and the results were stored as additional information in the EnzymeDetector result database.

Swiss-Prot
In the UniProtKB database of UniProt an ID mapping data file is stored. This file contains links between Uni-Prot enzyme information and genes of different organisms. The information of the analysed organism was obtained and stored in the EnzymeDetector result database. Only information of the manually curated Swiss-Prot part of the database is used.

Building the result database
The results of the procedure are stored in a relational database using MySQL, containing a combination of all collected and computed data. For each gene-enzyme combination found by the BLAST-search or present in one of the databases, an entry was created. For all entries three types of information are available: -Gene-related information -gene identifier from NCBI (GI), the gene position, and the source organism -Enzyme-related information -the EC number and the globally accepted name as defined by the IUBMB biochemical nomenclature committee -Evaluation-related information -the E-value of the best BLAST hit of the enzyme, the position of the hit, the number of enzymes that are suggested for the gene by the BLAST evaluation program, information on the number of databases that predicted the particular enzyme, and whether the enzyme is confirmed by the pattern-search program BrEPS.
A default scoring scheme was constructed for the weights of the different data sources based on a comparison with the manual Swiss-Prot annotation for the respective gene (as far as this was available). Precision (= 100 * true positives/(true positives + false positives)) and recall (= 100 * true positives/(true positives + false negatives)) of the sources were calculated. The default values for the sources were calculated based on the average F1-scores (= 2 * (precision * recall)/(precision + recall)). We set the relevance scores of the different sources in relation to the relevance score of the BLASTbased annotation. For a F1-score of 100% a relevance score of 13 is assigned, for a F1 score between 95 and 100% a score of 12, and for any other value the relevance score drops by one for each drop of the F1-score by 5%, leading to values of zero for F1-scores < 40%. These values were chosen relative to the top score of 8 for the BLAST-based annotation. This is a constraint arising from the classification of the BLAST results in 8 different groups. The other scores were defined dependent on that.
In Figure 2 the F1-scores of the different sources are shown. Additionally, the score of the combined information is shown in black, which is considerably higher than the score of any single source. Only the pattern-based BrEPS annotation has higher values for some of the organisms, but gives predictions only for 12% of the gene products annotated as enzymes. The fact that in some cases the combined result of the EnzymeDetector shows a lower agreement with the Swiss-Prot annotations than BrEPS, is based on the fact that the BrEPS can be overruled by the combined result of several other annotation sources.
According to the grouping of the F1-scores and the average F1 of the different databases, KEGG and PED-ANT were assigned a default value of 7, and PEDANT and NCBI default values of 6.
For the BLAST-based method according to the average F1 value a top score of 8 was determined. This score consists of two parts -on the one hand the score for the best E-value of the annotation found in general, i.e. in the whole UniProt database with TrEMBL included, and on the other hand the score of the best E-value found in the reviewed Swiss-Prot part. The overall score for the BLAST-based annotation is built by the sum of these two scores. The individual score is achieved by the classification of the quality measures in four groups: Annotations with an E-value greater than 10 -40 were assigned a score of 1. Those with E-values in the range from 10 -40 to 10 -80 were assigned a score of 2. For E-values ranging from 10 -80 to 10 -120 a score of 3 was added and for E-values smaller than 10 -120 a score of 4.
For the BrEPS evaluation a top score between 1 and 10 was assigned depending on the quality measure calculated from the program BrEPS.
For hand-curated data (e.g. Swiss-Prot and BRENDA) we assigned a score of 50. This value was chosen because it is considerably higher than the sum of the values of all other sources. This means that the handcurated data cannot be overruled by other sources in the comparison process.
A score of 25 was assigned to AMENDA. Although the information in AMENDA has a high reliability, it is based on a text-mining process. Thus, the data is not as certain as hand-curated data.
Swiss-Prot was chosen as standard of truth, because it has a large number of manually curated function assignments over a wide range of organisms. In all probability the different sources synchronise their annotation data with those in Swiss-Prot in constant intervals. Thereby, the F1-score of the annotation predictions for those genes where no Swiss-Prot entry is available is most certainly not as high as for the genes we analysed. Lacking an alternative for the determination of the ranking of the sources, we had to rely on the F1-scores determined against Swiss-Prot. It should be noted, that because the BLAST-based annotation is performed against UniProt and the query sequence is not excluded from the search results, the Swiss-Prot annotations get included in the evaluated results. But this is balanced by the fact that we not only use the E-values as a decision criterion, but the number of occurrences of an EC-number among the BLAST-hits as well. Thereby, even if the query sequence is found with a very good E-value, it will only be considered as a candidate if other sequences with that annotation match the search sequence as well.
The sum of all different relevance values define the overall-relevance of a result entry -the overall relevance score marking the quality of the annotation.

Evaluation of function predictions
The following statistics were done for the nine organisms mentioned above. For the analysed organisms on average an enzyme function was predicted for 30% of its genes (Table 1), using annotations that had an overall relevance score of at least 7.
This enzyme content matches the generally accepted value. As a reference value we took the Escherichia coli enzyme content of 35% as given by Swiss-Prot. We took E. coli as reference because it is one of the best-analysed organisms.
In only 29% of all annotations, the three main annotation databases predicted identical enzyme functions. For another 14% there was agreement between two of the three sources, and for 30% of all annotated genes only one of the three databases contained a function assignment at all (Figure 3). On average 19% of all genes with a predicted enzyme function were only annotated by the BLAST-based annotation and not in any of the main annotation databases. For the BLAST-based annotation, only hits with an E-value lower than 10 -80 were considered. The additional BLAST results can be explained by the fact, that the annotation of the other annotation sources may be based on earlier UniProt versions, or that different assignment criteria were used. The different annotation sites provide no information on the time period between updates of their annotations.
On average 13% of all additional annotations, added by the BLAST-based annotation, had a low E-value between 10 -50 and 10 -120 (Figure 4). 5% even had a very low E-value of <10 -120 . The 21% of annotations with Evalues between 10 -20 and 10 -50 represent promising candidates if an enzyme function is missing for the construction of a metabolic model. 61% of the annotations have an E-value higher than 10 -20 . These hits get a low relevance score and are thereby excluded, if an adequate cut-off is chosen. As expected, the function predictions for the hyperthermophilic archaeon Sulfolobus solfataricus had a lower average quality than for the analysed    bacteria, reflecting the small number of reliable enzyme sequences of Archaea and the highly specialised metabolism. Therefore, the BLAST hits displayed much higher average E-values. We grouped the overall relevance of the EnzymeDetector results in four categories ( Figure 5). We created these groups according to the three different cut-offs we suggest further down. For every gene only the best candidate was considered for this evaluation.
1. Annotations with an overall-relevance smaller than 7 (i.e. beneath the minimal cut-off we suggest) are shown in red. An average of 64% of all genes belong to that group, resulting mainly from BLAST hits with an intermediate E-value.
2. Qualitatively good annotations with an overallrelevance between 7 and 25 are shown in orange. 24% of the results can be found in this group. If an annotations has an overall-score in the lower range of this group, it was only found in one of the annotation sources and therefore might have to be checked by the scientist.
3. Annotations with a very good confidence are shown in yellow. Their overall-relevance is between 26 and 38. Those hits have a perfect recall and a precision of over 95%. 3% of the results belong to this group. 4. Annotations in the top-scoring group have an overall-relevance greater than 38. This group is shown in green. On average, 6% of the results belong to that group.
As expected the results for E. coli have the highest relevance scores. This is due to the fact that it is an experimentally very well-analysed organism with reliable annotations in the input databases, which yields high overall relevance.
If a gene annotation was found by the BLAST-based annotation and in at least one of the other sources, the prediction was identical in most of the cases ( Figure 6). As an example, the PEDANT and the BLAST-based annotations were identical in 51% of all cases, and in another 42% of the annotations, non-conflicting evidence was obtained (for example, the gene b0004 of E. coli K12 had an enzyme function of 4.2.3.1 in the Figure 5 EnzymeDetector relevance groups. The overall relevances of all evaluated organisms and the mean are shown merged into four groups according to the different cut-off suggestions we make. Only the best candidate for every gene was considered. Genes beneath the minimal cut-off of 7 are shown in red (overall-relevance 1-6). Genes with a good confidence level are shown in orange (overall-relevance 7-25), genes with a very good confidence level in yellow (overall-relevance 26-38). Annotations of the top scoring group with the highest confidence are shown in green (overall-relevance >38).

Utility
The EnzymeDetector website holds a database containing the described combined enzyme annotations. This database will be updated twice a year to keep the data up-to-date. The results are presented via a web interface, which allows the user to interactively explore, process, and download the data. In the current version, all prokaryotic organisms are included in the database, with the genome annotations from NCBI, KEGG, PEDANT, and Swiss-Prot, and the BRENDA and AMENDA data included. The BLAST-based annotation is added continuously (limited by available computer time). This may lead to the fact that no E-value information is provided for some organisms, and that for those organisms the highest reachable overall-relevance is smaller compared to those with a BLAST-based annotation.
An interactive help is displayed by selection of the help sign in the lower right corner of every site. Subsequently, a help or explanation window opens when the cursor is pointed at any object.
The organism can be selected by the user on the start page of the web interface. After this selection, the annotation sources currently available for that organism are displayed. Annotation sources to be included in the analysis can be selected. The default relevance scores for those sources are given and can be modified.
Additionally, the user can select default cut-off values for the extraction of the data from the result pool. We suggest tree different cut-offs depending on the quality of data the user wants to achieve. The recommended cut-off scheme is based on Figure 7. The cut-offs were defined by evaluation of the results of 81 analysed organisms (excluding the nine organisms representing the training data) against the accordant Swiss-Prot annotations (list of organisms can be found in additional file 1).
-For generous filtering we suggest a cut-off of 7. With this value the retrieved data has optimal recall, but a low precision. With this setting genes that are only annotated by the BLAST-based annotation (with a quality score of 7 and higher) are not lost. -For medium filtering we suggest a cut-off of 26. This is the lowest relevance score for which the average F1 is greater than 99%.
If maximum precision is wanted we suggest a cut-off of 39. This is the lowest relevance for which the F1 is maximal.
By default the cut-off value for the overall-relevance is set to the generous filtering option on the web interface. This can be changed by the user at any time.
The cut-off for the maximal E-value is set to 10 -25 . This cut-off only affects the data of the BLAST-based annotation. Only results with an E-value below the chosen cutoff are integrated in the BLAST-based annotation.
Both cut-off values can be changed at any time of the analysis.
On the web interface the user has the choice between four different views on the data: The tabular view ( Figure 8) By default, all columns are sorted by gene identifier. The user can sort the entries by EC number or accepted name by clicking on the respective column headers. It is possible to search the result table for a certain entry by using the search mask. The possible search fields are GI, gene position, EC number, and recommended name. Additionally, it is possible to filter the results for data source occurrence.
The cut-off values that are used for filtering the displayed data can be adapted at any time. If just one candidate for a gene within the selected constraints is available, the entry is automatically selected. If there are conflicting EC annotations, the user has to decide which annotation/s to select.
The selected subset of data or the whole dataset can be downloaded as a CSV file for further processing.
The statistics view ( Figure 9) By clicking on the corresponding tab, the user can switch to the statistics. The page is split into two parts - Figure 7 Average precision, recall and F1-score for the different overall relevances. The average precision is shown in yellow, the average recall is shown in red, and the F1-score is shown in orange. The values were calculated for 81 different organisms excluding the nine organisms of the training set. Figure 8 Tabular view of the web interface. In this case the results for Dinoroseobacter shibae strain DFL12 are presented. In the table a summary of our own result database is shown. For every gene-enzyme combination, a new data row is created with information about the gene (positions, GI), with information to the found annotation (recommended name, EC number, best E-value of the found annotation) and with information on the quality of the annotation (relevances of the input databases, overall relevance).
the static and the dynamic view. For the static view the whole dataset with default constraints is used. The dynamic view presents basically the same information, but the computation considers only those data entries that fulfil the user-chosen constraints. The selectable constraints are the minimal overall-relevance and the maximal E-value. Additionally, the user has the possibility to compare up to three of the annotation sources to obtain their degree of consistency.
The annotation comparison view ( Figure 10) In this view, the user has the possibility to compare the enzyme stock of the explored organism to that of one or two other organisms. All enzymes of the explored organisms are displayed together with their best E-value and their best overall relevance. All data sets can be downloaded.
The Pathway view ( Figure 11) The pathway view shows a list of all pathways from MetaCyc [25] and KEGG. The total number of enzymes in the pathway and the number of found enzymes are displayed. The enzymes that are missing are given as well. By default the table is sorted by pathway name, but it can also be sorted by the source or the coverage.

Outlook
A user upload field is planned. Thus, the user can upload an own annotation of the provided organism (in a defined format). This information will be integrated in the result of the web interface.

Discussion
The evaluation of the EnzymeDetector results clearly shows that reliance on only one annotation source cause in loss of valuable information. In only one third the big annotations host agree in their annotation. 19% of the annotations found by the EnzymeDetector were even just found by the performed BLAST-search.
The results of the EnzymeDetector help the user to find all information available for a genome and helps him to distinguish between the qualities of the annotations. The provided data of the web interface will be used by life scientists for obtaining information on a selected organism or gene of interest. Furthermore, the tool is certainly helpful for developers of metabolic models, providing more reliable information on the enzymes present in defined organisms.

Conclusions
For the detailed analysis of the metabolism of an organism, it is essential to have an accurate annotation of enzyme functions. Given that there are inconsistencies and errors in the existing databases, it is not recommended to rely on only one source. Hence, it is beneficial to integrate and compare the existing genome annotations of different sources. However, it is extremely time-consuming, if not impossible, to manually integrate all existing function predictions. Therefore, we provide the tool EnzymeDetector, which gives a fast and up-to-date overview of the available annotation data from a selected set of sources. In addition, it ranks the information by quality. The results are accessible via a web interface. Thus, it is easy for model developers or lab scientists to gain information about a gene of interest or the whole enzyme stock of an organism. It is Figure 10 Display of the annotation comparison view of the web interface. Here calculated for Escherichia coli, Corynebacterim glutamicum and Yersinia pseudotuberculosis. The enzyme stock of the calculated organism is shown in comparison to up to two other organisms. For each organism the best overall-relevance and the best E-value is shown. Figure 11 Display of the pathway view of the web interface. Here shown for Escherichia coli. The pathway names are shown together with their source (KEGG or Metacyc), their coverage and the enzymes that are missing.
Quester and Schomburg BMC Bioinformatics 2011, 12:376 http://www.biomedcentral.com/1471-2105/12/376 possible to assign a personal scoring scheme to the different annotation sources. This way a customised data set can be created. All information is downloadable in CSV format. Hence, the user can easily perform a detailed analysis with the data. An option will be added that allows the user to upload data from other sources in a predefined format. This will facilitate the integration of organism-specific databases, which improves the overall results.
Because the program performs a BLAST-search, the EnzymeDetector approach clearly shows better results for well-curated genomes like Escherichia coli. Clearly function assignment to genes based on that search is more significant with genes that have similarities to many known sequences.
The thresholds suggested in this paper are based on the analysis of nine organisms. These values will be regularly updated with analysis of the information of more organisms. Thus the threshold values will get more accurate or rather more adaptive to all organisms over time.
Currently EnzymeDetector results are only available for prokaryotes. The integration of eukaryotes is planned in the future.