- Research article
- Open Access
DomSign: a top-down annotation pipeline to enlarge enzyme space in the protein universe
© Wang et al.; licensee BioMed Central. 2015
- Received: 7 October 2014
- Accepted: 18 February 2015
- Published: 21 March 2015
Computational predictions of catalytic function are vital for in-depth understanding of enzymes. Because several novel approaches performing better than the common BLAST tool are rarely applied in research, we hypothesized that there is a large gap between the number of known annotated enzymes and the actual number in the protein universe, which significantly limits our ability to extract additional biologically relevant functional information from the available sequencing data. To reliably expand the enzyme space, we developed DomSign, a highly accurate domain signature–based enzyme functional prediction tool to assign Enzyme Commission (EC) digits.
DomSign is a top-down prediction engine that yields results comparable, or superior, to those from many benchmark EC number prediction tools, including BLASTP, when a homolog with an identity >30% is not available in the database. Performance tests showed that DomSign is a highly reliable enzyme EC number annotation tool. After multiple tests, the accuracy is thought to be greater than 90%. Thus, DomSign can be applied to large-scale datasets, with the goal of expanding the enzyme space with high fidelity. Using DomSign, we successfully increased the percentage of EC-tagged enzymes from 12% to 30% in UniProt-TrEMBL. In the Kyoto Encyclopedia of Genes and Genomes bacterial database, the percentage of EC-tagged enzymes for each bacterial genome could be increased from 26.0% to 33.2% on average. Metagenomic mining was also efficient, as exemplified by the application of DomSign to the Human Microbiome Project dataset, recovering nearly one million new EC-labeled enzymes.
Our results offer preliminarily confirmation of the existence of the hypothesized huge number of “hidden enzymes” in the protein universe, the identification of which could substantially further our understanding of the metabolisms of diverse organisms and also facilitate bioengineering by providing a richer enzyme resource. Furthermore, our results highlight the necessity of using more advanced computational tools than BLAST in protein database annotations to extract additional biologically relevant functional information from the available biological sequences.
- Enzyme mining
- Protein functional annotation
- Machine learning
- Top-down algorithm
Of the known biological sequences in the post-genomic era, the vast majority have not yet been, and cannot be, characterized by experimentation or manual annotation . For example, Swiss-Prot, a protein database with a manually curated functional annotation, has only 547,085 entries as of December 2014, whereas a comprehensive protein database such as UniProt-TrEMBL, which contains a high-quality computationally analyzed functional annotations and covers most of the known protein sequences, contains tens of millions of members. Therefore, automated annotation is necessary to assign functions to uncharacterized sequences. Enzymes are of special importance owing to their central roles in metabolism and their potential uses in biotechnology . Hence, a greater ability to predict enzyme functions will not only give biologists deeper insight into metabolism in general but also increase the toolkits for bioengineers.
Many novel bioinformatics tools with different bases, such as protein structure , functional clustering , evolutionary relationships  and biological systems networks , have been developed for enzyme or protein functional annotations. Many of them perform better [7,8] than conventional approaches like BLAST, which is based on pairwise comparisons of gene sequence similarities to assign functions to new genes . However, BLAST is currently the main approach used in functional annotations , whereas many recently developed tools are rarely applied in research projects . Additionally, BLAST-based functional annotations perform poorly when only distantly related homologs with similarities of <30% can be found [11,12]. Furthermore, many proteins recently discovered using metagenomics approaches do not have homologs with high enough amino acid sequence identity levels for reliable functional annotation. For example, in a benchmark study, which used a metagenomics approach focusing on cow rumen–derived biomass-degrading enzymes, it was found that in terms of amino acid sequence identity, only 12% of the 27,755 carbohydrate-activated genes assembled had >75% identity to genes deposited in NCBI-nr, whereas 43% of the genes had <50% identity to any known protein in NCBI-nr, NCBI-env and CAZy . Thus, if novel and combinatorial approaches are used, to what extent, with acceptable precision, can we improve the coverage of the protein annotation? For enzymes, there is a well-established system, the Enzyme Commission (EC) number , which describes catalytic functions hierarchically using four digits. As far as we know, although many EC number prediction tools are available, most are limited to performance tests within small datasets and none of them has been used to systematically address the comprehensiveness of enzyme functional annotation in public protein database. Thus, a more specific question, “To what extent we can improve, with an acceptable precision, the coverage of enzyme annotations using EC numbers?” is worth addressing by illustrating the power of approaches whose utility goes beyond BLAST. The insight we obtain can be also generalized to protein annotations for other functional attributes.
Thus, novel approaches with high coverage rates that maintain an acceptable precision are of special interest. Hierarchical or top-down algorithms with a layer-by-layer logic satisfy these requirements [15,16]. Such approaches assign functions only at a level that can be inferred with high confidence. Hence, in many cases, general rather than specific functions (for example, the top level of EC numbers) are assigned to avoid the overprediction of protein functions, such as annotation below the trusted cutoff or inference only from a superfamily, a main problem of current database annotations . Furthermore, this approach is suitable for widely accepted protein function definition systems, such as EC or Gene Ontology (GO), both of which are widely applied metrics systems to consistently describe the functions of gene products , owing to their hierarchical structure.
Domains are conserved parts of a given protein’s amino acid sequence and structure that can evolve, function and exist independently of the rest of the amino acid chain. Thus, it has been hypothesized that machine learning with domains as input labels might serve as a powerful approach to predict protein functions . For example, the dcGO database, based on associating SCOP domains or domain combinations with GO terms of protein products, infers the domain or domain combinations responsible for particular GO terms . A domain architecture–based approach might thus be a powerful tool for predicting enzymatic functions. Here, we report on “DomSign”, a top-down enzyme function (EC number) annotation pipeline based on domain signature–derived machine learning. We must emphasize, based on the belief that any reliable protein function prediction tools should depend on multiplicity , that our purpose here is not just to present a simple function prediction tool but rather to address the issue of to what extent can the coverage of enzyme annotations by EC numbers be improved, with acceptable accuracy, by methods beyond simple BLAST.
To test the reliability of DomSign, many benchmark enzyme annotation methods were compared with. The performance of DomSign was comparable, or superior, to all of them after exhaustive testing against reliable datasets, such as Swiss-Prot enzymes, suggesting that DomSign is a highly reliable enzyme annotation tool that can identify more enzymes in the protein universe. Furthermore, to expand the number of enzymes retrieved from large datasets, we compared our results with those proteins already assigned EC numbers in the original dataset. More ‘hidden enzymes’ were predicted by DomSign. Thus, DomSign, with >90% accuracy suggested by the tests, can be used to predict a large number of enzymes by assigning EC numbers to proteins in both the UniProt-TrEMBL  and Kyoto Encyclopedia of Genes and Genomes (KEGG)  bacterial subsection, which, respectively, represent the most complete protein database and best metabolic pathway information collection. DomSign also can be applied to metagenomic samples as exemplified by the Human Microbiome Project (HMP) dataset , a comprehensive and well-analyzed metagenomic gene dataset focused on parsing the interactions between commensal microorganisms of humans (human microbiome) and human health. In this case, DomSign not only significantly increased the number of EC-labeled enzymes but also helped to clarify the metabolic capacity of the sample by recovering new EC numbers beyond the official annotation. These results highlight the necessity to develop enzyme EC number prediction projects or, more generally, protein annotation projects with novel approaches akin to DomSign to extract more biological information from the available sequencing data.
Definition of a domain signature
Pfam is a protein domain collection with ~80% coverage of the current protein universe , and its Pfam-A subsection is highly reliable owing to its manually curated seed alignment. For our purpose, a string of non-duplicated Pfam-A domains belonging to a protein was defined as its domain signature (DS) and used to predict function(s). Although some research has suggested a potential advantage of involving domain recurrence and order in protein GO assignments , our results showed that this simpler DS definition provided a higher coverage for proteins identified in metagenomics studies. When utilizing Swiss-Prot protein DSs to retrieve HMP phase I non-redundant proteins, the coverage was 74.7% when considering domain recurrence and order versus 77.1% with more simple definition. Unlike the GO term assignment used previously , recurrence did not lead to a significant difference in coverage as indicated by reconstructing the EC number machine-learning prediction model (Additional file 1) used in this work, whose method is presented in the following part. Thus, because the main aim of our study was to improve enzyme annotation coverage, our simpler DS definition was applied.
Preparation of the dataset
Swiss-Prot and TrEMBL datasets were downloaded on November 2, 2013, from the Pfam ftp site (version 27.0) from which Pfam-A domains were extracted. Pfam-A Hidden Markov Model (PfamA.hmm) for hmmsearch (version 3.1b1)  was accessed from the same site. The HMP phase I non-redundant protein dataset (95% identity cut-off, 15,006,602 entries from 690 samples)  was collected from the HMP data processing center (http://www.hmpdacc.org/). A benchmark dataset for unbiased tests was collected from  (Supplementary Data 2). The files (gene IDs and sequences in the fasta format) from KEGG were downloaded on March 6, 2014. The EC2GO mapping file  was downloaded on June 20, 2014 from the GO homepage. All of these files were further processed as stated below.
“Sprot enzyme” dataset
The Swiss-Prot dataset is a protein collection with an exhaustive manually curated—and thus reliable—functional annotation. In this context, it was a good choice working as the training set for comparing prediction model performance by cross-validation. The subset of enzymes in Swiss-Prot with both single EC numbers and Pfam-A domains was termed “sprot enzyme”, encompassing 228,710 entries and 4,216 distinct DSs. This set was used to construct the “Specific enzyme domain signature” dataset as described below and also as a training dataset to build the prediction model for enzyme mining in several general protein databases (TrEMBL, KEGG and HMP).
“Sprot protein” dataset
Another subset of Swiss-Prot, which contains all of the Pfam-A proteins with single or no EC numbers, was named “sprot protein”, encompassing 46.8% enzymes (with single EC numbers) and 53.2% non-enzymes (without EC numbers), which covers 99.0% of the Swiss-Prot proteins with Pfam-A domains. This dataset was used for model parameter optimization and performance comparisons against BLAST and FS models (see descriptions below in Methods about FS model) .
“Specific enzyme domain signature” dataset
To identify enzymes from the protein pool, we further constructed a “Specific enzyme domain signature” dataset. The fundamental idea was to remove non-enzyme-derived DSs from the 4,216 distinct DSs belonging to “sprot enzyme”. Because EC numbers do not cover all enzymes, however, a more reliable non-enzymatic dataset beyond simple proteins without EC numbers needed to be constructed. Briefly, for the proteins without EC numbers in Swiss-Prot, their annotation raw files (‘KW’, ‘DR’ and ‘DE’ lines) were filtered using a catalytic or functional uncertainty–inferring term (‘iron sulfur’, ‘uncharacterized’, ‘biosynthesis’, ‘ferredoxin’, ‘ase’, ‘enzyme’, ‘hypothetic’, ‘putative’ and ‘predicted’) to reliably extract non-enzymes. By this means, we collected 2,901 unique DSs from 157,240 non-enzymes carrying Pfam-A domains. After removing these DSs from the “sprot enzyme” DS set, 3,949 specific enzyme DSs were acquired, covering 95.4% of “sprot enzyme”. This dataset was used for selecting enzyme candidates from a protein pool using the benchmark comparison method and enzyme mining process.
“SVMHL unbiased” dataset
To compare the performance of our approach with the SVMHL pipeline (see descriptions below in Methods about SVMHL model) , the aforementioned unbiased dataset was further processed to remove, as described in their paper, enzyme sub-subfamilies with fewer than 50 members.
“TrEMBL enzyme” and “HMP enzyme” datasets
The TrEMBL raw dataset was filtered to extract enzymes with single EC numbers and Pfam-A domains, producing “TrEMBL enzyme”. Likewise, “HMP enzyme” was constructed from the HMP non-redundant protein set. Pfam-A domains were retrieved by an hmmsearch against PfamA.hmm using the cut_tc cutoff with all other parameters set as default. These two datasets were used as the gold standards to test the reliability of the DomSign-based enzyme EC number annotation prior to the actual enzyme mining of TrEMBL and HMP original datasets. The statistics and usage of the datasets constructed in this work are presented in Additional file 2.
Prediction model description
Our prediction model consists of two separate steps: enzyme differentiation from the protein pool and EC number annotation based on machine learning. In the first step, proteins in query datasets are recognized as potential enzyme candidates if their DSs are among the aforementioned “Specific enzyme domain signature” set. In the second step, a top-down machine-learning model is developed to predict EC numbers.
From model to prediction engine
Performance evaluation statistics
Comparison with BLASTP and FS by cross-validation
For the FS model, the script package from Forslund K. et al.  was run on our system to calculate the GO terms derived from the DS defined in their work. Subsequently, we used the EC2GO mapping file to convert the FS model’s predicted GO terms to EC numbers. If multiple EC numbers existed for one particular GO term, we assigned that protein all of the relevant associated EC numbers. The three pipelines were tested by 1,000-fold cross-validations of the “sprot protein” dataset. Because the dataset has only enzymes with single EC numbers or non-enzymes, if the FS model predicted more than one EC number for a query then the result was “OP”. Furthermore, to simulate the situation in which no sequences in the database have a high similarity to the query protein, two additional rounds of cross-validations against “sprot protein” were executed. Briefly, sequences in the training set having specificities above threshold I (60% identity, 80% query coverage) and II (30% identity, 80% query coverage) with any query sequence, respectively, were removed by BLASTP. In this way, any sequence in the training set is no more similar to any query sequence than the defined threshold. These two rounds of cross-validation, together with the common cross-validation, were termed “identity ≤ 100%”, “identity ≤ 60%” and “identity ≤ 30%”. For BLASTP, a 10−3 E-value and default parameters were applied. For the FS model, parameters were set as default for the processing.
Comparison with the SVMHL model by cross-validation
Because the source code of SVMHL is not available, we compared DomSign with SVMHL by the same test as stated in , and the raw data were used for performance comparisons. Briefly, a 10-fold cross-validation was conducted using DomSign on the “SVMHL unbiased dataset”, and prediction accuracy  was used to evaluate the results. In this case, accuracy is defined as the percentage of completely correct annotations. Here, one predicted EC number at one specific hierarchy level (an EC number consisting of three digits when the EC hierarchy level is three) is set as ‘correct’ when its component digits are all correct. Because SVMHL does not have an enzyme and non-enzyme differentiation step, we included only the predicted “enzyme” by DomSign in the results comparison, which covered 85.2% ± 0.4% of the query proteins on average during the cross-validation.
Comparison with EnzML
Here ‘m’ refers to total number of proteins, and TEi and PEi refer to the sets of annotated EC numbers at four hierarchical levels or ‘Non-enzymes’ for each protein.
Enzyme predictions from large-scale datasets
“Sprot enzyme” was used as the test dataset, and “Specific enzyme domain signature” was used to select enzyme candidates. “TrEMBL enzyme” and “HMP enzyme”, combined with their original annotations, were used to evaluate the reliability of DomSign for expanding enzyme space. All TrEMBL and HMP proteins were then annotated by DomSign to test the extent of the enzyme expansion. Further, to show the significance of enzyme expansion in KEGG, among the predicted novel enzymes of TrEMBL, novel enzymes for 2,584 bacterial genomes in KEGG were extracted. Owing to the subtle differences between KEGG and TrEMBL annotations, a few novel enzymes in TrEMBL have EC numbers in KEGG. These were removed to retrieve the exact number of novel enzymes from KEGG, and the relevant statistics were calculated.
Optimization of the DomSign specificity threshold
We tested the reliability of DomSign as an EC number prediction tool. Because we designed a parameter “specificity threshold” (Methods) in DomSign to balance the tradeoff between precision and recall (Figure 2), three rounds of 1,000-fold cross-validations (“identity ≤ 100%”, “identity ≤ 60%” and “identity ≤ 30%” cutoffs as described in Methods) were performed on the “sprot protein” dataset using DomSign with 99%, 90%, 80% and 70% specificity thresholds to optimize this parameter (Additional file 4). Among the 99%, 90% and 80% specificity thresholds, the 80% had the best coverage (IA, E and IM) and a slightly increased error rate (OP). However, further reduction of the specificity threshold to 70% resulted in a much smaller increase in coverage accompanied with a relatively severe OP ratio, especially for the “identity ≤ 30%” group, indicating that 80% might be the optimal specificity threshold for DomSign. Thus, we applied this parameter in further analyses.
Comparisons among DomSign, BLAST and FS models
BLAST was selected as the benchmark because of its wide application in research, and we used the best hit of BLAST to assign EC numbers. The FS model applies similar DS definitions, with no consideration for recurrence or order. However, it considers the contributions of every subset of DSs rather than regarding them as intact labels. Briefly, this model utilizes Bayesian statistical methods to evaluate the possibility of one particular GO annotation term inferred from all the subsets of the DS. By averaging the contributions of all the subsets, the probability of one protein having this annotation term can be calculated accordingly. There are three reasons for the comparison with the FS model: first, it utilizes domain information to assign GO terms. Thus, it can act as a good benchmark among the domain architecture–based methods. Secondly, this method yields reliable GO assignments, even in the situation where UniRef50 is applied for cross-validation, indicating the performance stability in an unbiased condition; and finally, the FS model provides a very user-friendly package for command line usage. Here, we converted GO terms to EC numbers using the EC2GO mapping file provided by the GO consortium .
Similar to the last section, to compare the performance among DomSign, BLAST and FS models, especially when the database contained no sequences having high similarities to the query protein, three rounds of 1,000-fold cross-validations (“identity ≤ 100%”, “identity ≤ 60%” and “identity ≤ 30%” as described in Methods) were conducted on the “sprot protein” dataset by DomSign with an 80% specificity threshold, BLASTP with a 10−3 E-value and the FS model with default parameter settings. It is necessary to emphasize the importance of performance tests using this scenario because BLAST itself performs enzyme functional annotations well (above 90% precision and recall in some situations) when homologs with similarities above a particular threshold are available . Thus, there is limited room for further improvement in this regard, whereas there is ample need for improvement when homologs are unavailable. With the accumulation of novel sequences, this issue is expected to become more important. Thus, in the development of a new generation of computational approaches, more attention should be paid to the “homolog unavailable scenario”. As shown in Figure 3, machine learning–based methods, such as DomSign and the FS model, are much more robust when there is a reduced homolog availability compared with BLAST. Meanwhile, with a significant increase in “No best hit” (Figure 3B), coverage for BLAST decreases dramatically. Hence, in contrast to the nearly perfect performance of BLAST in the “identity ≤ 100%” group, DomSign achieved an overall performance superior to BLAST in the case of “identity ≤ 30%”, producing a comparable OP ratio but much higher coverage. Meanwhile, the FS model tended to have a very high OP ratio in all three tests, partly because of the multiple EC number predictions (Figure 3A) in this single EC enzyme plus non-enzyme dataset (Additional file 2) and partly because of incorrect EC assignments (both reasons contributed ~50% to the high OP level in the FS model, Figure 3A, B). Therefore, DomSign has the potential to partly replace BLAST as a functional annotation tool for novel proteins that have no homologs in the database.
Comparison with SVMHL using an unbiased dataset
To further test the effectiveness of DomSign with respect to avoiding potential bias towards abundant enzyme families , the “SVMHL unbiased dataset” was subjected to a 10-fold cross-validation because any two sequences have <50% identity and the enzymes are manually selected to cover most of the enzyme families without bias. The SVMHL model  is the benchmark that annotates EC hierarchy by considering two main features, namely the abundance of every possible tripeptide sequence within a polypeptide  and a protein structure–based enzymatic function prediction model. The annotation accuracy of DomSign and SVMHL at the second and third EC hierarchy levels is shown in Additional file 5. Although the accuracy for the SVMHL model at the second hierarchy level was slightly greater than that of DomSign, at the third hierarchy level DomSign outperformed SVMHL for most enzyme families. Because Wang et al.  did not present their results at the fourth level, only the DomSign results at this level are shown (Additional file 5). Based on this comparison, DomSign works well in the unbiased situation compared with other benchmark methods.
Comparison with EnzML
Enzyme prediction in UniProt-TrEMBL and KEGG
Having demonstrated the reliability of DomSign, we annotated the whole protein space to determine if we could improve the prediction coverage of enzymes with EC numbers. UniProt-TrEMBL was used in this scenario owing to its exhaustive coverage of the known protein universe.
To test the precision of this enzyme prediction model, we ran the DomSign annotation against the “TrEMBL enzyme” set, which contained enzymes with single EC numbers in the TrEMBL database (Additional file 6). DomSign with an 80% specificity threshold yielded a 6.6% OP ratio while assigning EC numbers to ~90% enzymes. This OP ratio, which is higher than previous validations, may be due to the greater degree of error in the TrEMBL annotation . This result, combined with the performance test, demonstrated that the enzyme space expansion effort we conducted, as described below, was highly reliable.
Enzyme predictions in metagenomic samples
Although millions of proteins have been discovered by the biological community, our knowledge of the protein world is still far from complete, and new metagenomic data provide us with new resources to explore . Thus, we chose the HMP dataset as a test set to expand the enzyme space for proteins identified in metagenomic datasets using DomSign. Additionally, a combinational annotation pipeline in HMP using BLAST, TIGRFAM and Pfam-A  would be expected to be a good benchmark against which to compare DomSign in the functional annotations of metagenomic sequences.
As with TrEMBL, we first applied DomSign enzyme prediction to the “HMP enzyme” set to assess DomSign’s ability to predict enzymes. Compared with previous tests, much higher OP ratio (9.2%) was observed for DomSign with an 80% specificity threshold (Additional file 10). Despite the inability to evaluate the reliability of HMP annotations in this analysis, similar to the high error values in automatically annotated protein datasets such as TrEMBL , the quality of automatic HMP annotations is probably not as high as a manually curated set like Swiss-Prot. Thus, HMP annotation errors partly explain this abnormally high OP ratio, which is strongly supported by the fact that the OP ratio reached 5.4% even for DomSign with a 99% specificity threshold. These results still support the hypothesis that the reliability of the DomSign-based enzyme space expansion in HMP metagenomic datasets is acceptable.
Limitations of DomSign
In this preliminary trial, our method performed well under diverse conditions, including having only distantly related sequences in the reference database (“sprot enzyme identity ≤ 30%”) and a query set without bias towards rich enzyme families (“SVMHL unbiased dataset”), indicating its potential to predict enzyme EC numbers in large-scale datasets. However, the precision and recall of this method are still not perfect.
First, even DomSign with a 99% specificity threshold results in a 3.6% OP ratio in the “identity ≤ 30%” 1,000-fold cross-validation. This is mainly because the domain architecture is unable to fully encode enzymatic activity, especially substrate specificity [38,39]. Substrate specificity determination is complex , especially for some superfamilies with diverse catalytic functions , and thus much effort has been devoted to this task using pioneering methods such as determining key functional residues in enzymes , key-residue 3D templates  and substrate de novo docking . Future work will likely include the integration of these methodologies into our pipeline to more precisely predict the substrate specificity–determining fourth EC digit. With the development of DS databases, we can further increase the resolution of our method by involving more unique protein signatures, such as those from InterPro . By this means, further increases in performance can be expected without changing the basic workflow of our method.
The comparison with SVMHL revealed variability in the performance of predicting EC number among different enzyme families. This corroborated a previous report that the worst result was obtained for oxidoreductase, as we observed with DomSign . A possible solution is to utilize a combinational approach because different methodologies have diverse strengths for annotating specific enzyme families. SVMHL captures the sequence-function relationship of oxidoreductases quite well using triad abundance and structure . Finally, as suggested by the comparison with EnzML, DomSign tends to have a high IA rate because it incorrectly predicts enzymes as non-enzymes. Considering that DomSign uses a very strict “yes or no” methodology to classify non-enzymes and enzymes at the first step in the pipeline, it could be improved by applying a probabilistic approach, such as the “specificity” we used in later iterations of DomSign for predicting EC numbers.
Perspective expansion of enzyme space
To our knowledge, our present study represents the first systematic attempt to determine the extent to which the coverage of enzyme annotation by EC numbers could be improved, with acceptable precision, by methods beyond simple BLAST. By trying to close the gap between available EC-tagged enzymes in current databases and the real number of enzymes working in organisms, we showed that the quantity of EC-tagged enzymes can be significantly improved with high precision using relatively simple but reliable tools, such as DomSign, whether the sample is genomic or metagenomic. A series of assessments was performed to test the ability of DomSign to expand the enzyme space in large-scale protein datasets. This included a performance comparison with other benchmark enzyme annotation methods (Figures 3 and 4 Additional file 5) and a prediction and result comparison using large-scale protein sets whose members had already been assigned EC numbers, such as TrEMBL (Additional file 6) and HMP (Additional file 10). Under all conditions, the precision rate was >90% and recall was quite remarkable.
The results of the first large-scale critical assessment of protein function annotations (CAFA) were recently published . One of the main conclusions of CAFA was that many advanced methods for protein function annotation are superior to the first generation of methods, such as BLAST. Most of the top-ranked methods in CAFA utilized a machine learning–based computational approach. As suggested by Furnham N et al. , however, first-generation annotation methods are still used in most research. For instance, in a previous version of SEED, an intensively used comparative genomics environment, homology-based functional transfer is the main method of annotation. This is also true for UniProt. In recent releases, UniProt incorporated the HAMAP system , and SEED complements its annotation strategy using a k-mer-based subsystem and FIGfam recognition approach ; still, these approaches depend on sequence similarity–based function transfers, such as functionally homologous family profiles. The situation is essentially the same for benchmark metagenomic projects such as HMP [24,47]. With the development of metagenomics, many more sequences will be derived from environmental samples and will be novel compared with the current databases. In such cases, as shown in our work and that of many others [11,13], similarity-based function transfer will struggle to achieve the desired performance.
As our work demonstrates, there is still need to improve the ability to predict more enzymes using in silico methods. Only 12% of the proteins in UniProt have EC numbers. In the HMP phase I 95% non-redundant set, this value is 13% (Figure 6). All of the values are far below the average 30% enzyme ratio of the nine intensively studied organisms . We believe that a richer annotated sequence resource will result once this gap is closed using a hierarchical or top-down machine-learning method. This will allow researchers to not only study many important biological questions such as orphan enzyme gene identification  and metabolism network reconstruction  but also improve strategies used in biotechnology, including secondary metabolism gene cluster identification , artificial biosynthesis pathway design , novel enzyme mining  and metabolic engineering .
In this work, we developed a novel enzyme EC number prediction tool, DomSign, which is superior to conventional BLAST for the homolog unavailable scenario. In addition, other novel and outstanding enzyme functional annotation tools were selected as benchmarks and these were used to run comparisons against DomSign, which confirmed the superior or competitive ability in enzyme functional annotation of DomSign. The DomSign method requires only the amino acid sequences, without the need for existing annotations or structures. Based on the test results, the performance of DomSign should be improved by incorporating more exhaustive protein signatures, such as substrate specificity-determining residues, and revising the pipeline to select enzyme candidates using a probabilistic approach.
Using DomSign, we tried to address whether a large number of ‘hidden enzymes’ without EC number annotations exist in current protein databases, such as TrEMBL, KEGG and metagenomic sets like HMP. Our results preliminarily confirmed this hypothesis by significantly improving the ratio of EC-tagged enzymes in these databases. The illustration and annotation of these enzymes should significantly deepen our understanding of the metabolisms of diverse organisms or consortia, and also facilitate bioengineering by providing a richer enzyme resource. Furthermore, our results highlight the necessity to involve more advanced tools than BLAST in protein database annotations, thereby extracting more biological information from the available number of biological sequences.
We are grateful to Mr. Koichi Higashi and Dr. Masaaki Kotera from Tokyo Institute of Technology for their critical reading about the manuscript and constructive feedback. This work is supported by JSPS KAKENHI (Grant number 25710016) and CSC Postgraduate Scholarship Program (201306210186).
- Friedberg I. Automated protein function prediction–the genomic challenge. Brief Bioinform. 2006;7:225–42.View ArticlePubMedGoogle Scholar
- Pitkänen E, Rousu J, Ukkonen E. Computational methods for metabolic reconstruction. Curr Opin Biotechnol. 2010;21:70–7.View ArticlePubMedGoogle Scholar
- Roy A, Yang J, Zhang Y. COFACTOR: an accurate comparative algorithm for structure-based protein function annotation. Nucleic Acids Res. 2012;40(Web Server issue):W471–7.View ArticlePubMedPubMed CentralGoogle Scholar
- Lee DA, Rentzsch R, Orengo C. GeMMA: functional subfamily classification within superfamilies of predicted protein structural domains. Nucleic Acids Res. 2010;38:720–37.View ArticlePubMedGoogle Scholar
- Gaudet P, Livstone MS, Lewis SE, Thomas PD. Phylogenetic-based propagation of functional annotations within the Gene Ontology consortium. Brief Bioinform. 2011;12:449–62.View ArticlePubMedPubMed CentralGoogle Scholar
- Jensen LJ, Kuhn M, Stark M, Chaffron S, Creevey C, Muller J, et al. STRING 8–a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Res. 2009;37(Database issue):D412–6.View ArticlePubMedGoogle Scholar
- Radivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, et al. A large-scale evaluation of computational protein function prediction. Nat Methods. 2013;10:221–7.View ArticlePubMedPubMed CentralGoogle Scholar
- Yu C, Zavaljevski N, Desai V, Reifman J. Genome-wide enzyme annotation with precision control: catalytic families (CatFam) databases. Proteins. 2009;74:449–60.View ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST : a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–402.View ArticlePubMedPubMed CentralGoogle Scholar
- Furnham N, Garavelli JS, Apweiler R, Thornton JM. Missing in action: enzyme functional annotations in biological databases. Nat Chem Biol. 2009;5:521–5.View ArticlePubMedGoogle Scholar
- Rost B. Enzyme function less conserved than anticipated. J Mol Biol. 2002;318:595–608.View ArticlePubMedGoogle Scholar
- Addou S, Rentzsch R, Lee D, Orengo CA. Domain-based and family-specific sequence identity thresholds increase the levels of reliable protein function transfer. J Mol Biol. 2009;387:416–30.View ArticlePubMedGoogle Scholar
- Hess M, Sczyrba A, Egan R, Kim T-W, Chokhawala H, Schroth G, et al. Metagenomic discovery of biomass-degrading genes and genomes from cow rumen. Science. 2011;331:463–7.View ArticlePubMedGoogle Scholar
- Todd AE, Orengo CA, Thornton JM. Evolution of function in protein superfamilies, from a structural perspective. J Mol Biol. 2001;307:1113–43.View ArticlePubMedGoogle Scholar
- Shen H-B, Chou K-C. EzyPred: a top-down approach for predicting enzyme functional classes and subclasses. Biochem Biophys Res Commun. 2007;364:53–9.View ArticlePubMedGoogle Scholar
- Akiva E, Brown S, Almonacid DE, Barber AE, Custer AF, Hicks MA, et al. The structure-function linkage database. Nucleic Acids Res. 2014;42(Database issue):D521–30.View ArticlePubMedGoogle Scholar
- Schnoes AM, Brown SD, Dodevski I, Babbitt PC. Annotation error in public databases: misannotation of molecular function in enzyme superfamilies. PLoS Comput Biol. 2009;5:e1000605.View ArticlePubMedPubMed CentralGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: tool for the unification of biology. The gene ontology consortium. Nat Genet. 2000;25:25–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Forslund K, Sonnhammer ELL. Predicting protein function from domain content. Bioinformatics. 2008;24:1681–7.View ArticlePubMedGoogle Scholar
- Fang H, Gough J. DcGO: database of domain-centric ontologies on functions, phenotypes, diseases and more. Nucleic Acids Res. 2013;41(Database issue):D536–44.View ArticlePubMedGoogle Scholar
- Rentzsch R, Orengo CA. Protein function prediction–the power of multiplicity. Trends Biotechnol. 2009;27:210–9.View ArticlePubMedGoogle Scholar
- The UniProt Consortium. Activities at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2014;42(Database issue):D191–8.Google Scholar
- Kanehisa M. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28:27–30.View ArticlePubMedPubMed CentralGoogle Scholar
- The Human Microbiome Project Consortium. A framework for human microbiome research. Nature. 2012;486:215–21.View ArticleGoogle Scholar
- Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, et al. The Pfam protein families database. Nucleic Acids Res. 2012;40(Database issue):D290–301.View ArticlePubMedGoogle Scholar
- Messih MA, Chitale M, Bajic VB, Kihara D, Gao X. Protein domain recurrence and order can enhance prediction of protein functions. Bioinformatics. 2012;28:i444–50.View ArticlePubMedPubMed CentralGoogle Scholar
- Eddy SR. Accelerated profile HMM searches. PLoS Comput Biol. 2011;7:e1002195.View ArticlePubMedPubMed CentralGoogle Scholar
- Hill DP, Davis AP, Richardson JE, Corradi JP, Ringwald M, Eppig JT, et al. Program description: strategies for biological annotation of mammalian systems: implementing gene ontologies in mouse genome informatics. Genomics. 2001;74:121–8.View ArticlePubMedGoogle Scholar
- Wang Y-C, Wang Y, Yang Z-X, Deng N-Y. Support vector machine prediction of enzyme function with conjoint triad feature and hierarchical context. BMC Syst Biol. 2011;5 Suppl 1 Suppl 1:S6.View ArticlePubMedGoogle Scholar
- De Ferrari L, Aitken S, van Hemert J, Goryanin I. EnzML: multi-label prediction of enzyme classes using InterPro signatures. BMC Bioinformatics. 2012;13:61.View ArticlePubMedPubMed CentralGoogle Scholar
- Tsoumakas G, Katakis I, Vlahavas I: Data Mining and Knowledge Discovery Handbook. 2010(Mlc).Google Scholar
- Chou K-C. Some remarks on protein attribute prediction and pseudo amino acid composition. J Theor Biol. 2011;273:236–47.View ArticlePubMedGoogle Scholar
- Shen J, Zhang J, Luo X, Zhu W, Yu K, Chen K, et al. Predicting protein-protein interactions based only on sequences information. Proc Natl Acad Sci U S A. 2007;104:4337–41.View ArticlePubMedPubMed CentralGoogle Scholar
- Hunter S, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, et al. InterPro: the integrative protein signature database. Nucleic Acids Res. 2009;37(Database issue):D211–5.View ArticlePubMedGoogle Scholar
- Tsoumakas G, Spyromitros-Xioufis E, Vilcek J, Vlahavas I. MULAN: a java library for multi-label learning. J Mach Learn Res. 2011;12:2411–4.Google Scholar
- Desai DK, Nandi S, Srivastava PK, Lynn AM. ModEnzA: accurate identification of metabolic enzymes using function specific profile HMMs with optimised discrimination threshold and modified emission probabilities. Adv Bioinformatics. 2011;2011:743782.View ArticlePubMedPubMed CentralGoogle Scholar
- Kumar N, Skolnick J. EFICAz2.5: application of a high-precision enzyme function predictor to 396 proteomes. Bioinformatics. 2012;28:2687–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Bashton M, Thornton JM. Domain-ligand mapping for enzymes. J Mol Recognit. 2009;23:194–208.Google Scholar
- Brown SD, Gerlt JA, Seffernick JL, Babbitt PC. A gold standard set of mechanistically diverse enzyme superfamilies. Genome Biol. 2006;7:R8.View ArticlePubMedPubMed CentralGoogle Scholar
- Rodriguez GJ, Yao R, Lichtarge O, Wensel TG. Evolution-guided discovery and recoding of allosteric pathway specificity determinants in psychoactive bioamine receptors. Proc Natl Acad Sci U S A. 2010;107:7787–92.View ArticlePubMedPubMed CentralGoogle Scholar
- Nagao C, Nagano N, Mizuguchi K. Relationships between functional subclasses and information contained in active-site and ligand-binding residues in diverse superfamilies. Proteins. 2010;78:2369–84.View ArticlePubMedGoogle Scholar
- Arakaki AK, Huang Y, Skolnick J. EFICAz2: enzyme function inference by a combined approach enhanced by machine learning. BMC Bioinformatics. 2009;10:107.View ArticlePubMedPubMed CentralGoogle Scholar
- Amin SR, Erdin S, Ward RM, Lua RC, Lichtarge O. Prediction and experimental validation of enzyme substrate specificity in protein structures. Proc Natl Acad Sci U S A. 2013;110:E4195–202.View ArticlePubMedPubMed CentralGoogle Scholar
- Zhao S, Kumar R, Sakai A, Vetting MW, Wood BM, Brown S, et al. Discovery of new enzymes and metabolic pathways by using structure and genome context. Nature. 2013;502:698–702.View ArticlePubMedPubMed CentralGoogle Scholar
- Pedruzzi I, Rivoire C, Auchincloss AH, Coudert E, Keller G, de Castro E, et al. HAMAP in 2013, new developments in the protein family classification and annotation system. Nucleic Acids Res. 2013;41(Database issue):D584–9.View ArticlePubMedGoogle Scholar
- Overbeek R, Olson R, Pusch GD, Olsen GJ, Davis JJ, Disz T, et al. The SEED and the rapid annotation of microbial genomes using subsystems technology (RAST). Nucleic Acids Res. 2014;42(Database issue):D206–14.View ArticlePubMedGoogle Scholar
- Tanenbaum DM, Goll J, Murphy S, Kumar P, Zafar N, Thiagarajan M, et al. The JCVI standard operating procedure for annotating prokaryotic metagenomic shotgun sequencing data. Stand Genomic Sci. 2010;2:229–37.View ArticlePubMedPubMed CentralGoogle Scholar
- Quester S, Schomburg D. EnzymeDetector: an integrated enzyme function prediction tool and database. BMC Bioinformatics. 2011;12:376.View ArticlePubMedPubMed CentralGoogle Scholar
- Yamada T, Waller AS, Raes J, Zelezniak A, Perchat N, Perret A, et al. Prediction and identification of sequences coding for orphan enzymes using genomic and metagenomic neighbours. Mol Syst Biol. 2012;8:581.View ArticlePubMedPubMed CentralGoogle Scholar
- Orth JD, Conrad TM, Na J, Lerman JA, Nam H, Feist AM, et al. A comprehensive genome-scale reconstruction of Escherichia coli metabolism–2011. Mol Syst Biol. 2011;7:535.View ArticlePubMedPubMed CentralGoogle Scholar
- Medema MH, Blin K, Cimermancic P, De Jager V, Zakrzewski P, Fischbach MA, et al. AntiSMASH: rapid identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genome sequences. Nucleic Acids Res. 2011;39(Web Server issue):W339–46.View ArticlePubMedPubMed CentralGoogle Scholar
- Carbonell P, Parutto P, Herisson J, Pandit SB, Faulon J-L. XTMS: pathway design in an eXTended metabolic space. Nucleic Acids Res. 2014;42(Web Server issue):W389–94.View ArticlePubMedPubMed CentralGoogle Scholar
- Schallmey M, Koopmeiners J, Wells E, Wardenga R, Schallmey A. Expanding the halohydrin dehalogenase enzyme family: identification of novel enzymes by database mining. Appl Environ Microbiol. 2014;80:7303–15.View ArticlePubMedPubMed CentralGoogle Scholar
- Ro D-K, Paradise EM, Ouellet M, Fisher KJ, Newman KL, Ndungu JM, et al. Production of the antimalarial drug precursor artemisinic acid in engineered yeast. Nature. 2006;440:940–3.View ArticlePubMedGoogle Scholar
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.