Homology to peptide pattern for annotation of carbohydrate-active enzymes and prediction of function
© The Author(s). 2017
Received: 20 December 2016
Accepted: 5 April 2017
Published: 12 April 2017
Carbohydrate-active enzymes are found in all organisms and participate in key biological processes. These enzymes are classified in 274 families in the CAZy database but the sequence diversity within each family makes it a major task to identify new family members and to provide basis for prediction of enzyme function. A fast and reliable method for de novo annotation of genes encoding carbohydrate-active enzymes is to identify conserved peptides in the curated enzyme families followed by matching of the conserved peptides to the sequence of interest as demonstrated for the glycosyl hydrolase and the lytic polysaccharide monooxygenase families. This approach not only assigns the enzymes to families but also provides functional prediction of the enzymes with high accuracy.
We identified conserved peptides for all enzyme families in the CAZy database with Peptide Pattern Recognition. The conserved peptides were matched to protein sequence for de novo annotation and functional prediction of carbohydrate-active enzymes with the Hotpep method. Annotation of protein sequences from 12 bacterial and 16 fungal genomes to families with Hotpep had an accuracy of 0.84 (measured as F1-score) compared to semiautomatic annotation by the CAZy database whereas the dbCAN HMM-based method had an accuracy of 0.77 with optimized parameters. Furthermore, Hotpep provided a functional prediction with 86% accuracy for the annotated genes. Hotpep is available as a stand-alone application for MS Windows.
Hotpep is a state-of-the-art method for automatic annotation and functional prediction of carbohydrate-active enzymes.
KeywordsCarbohydrate-active enzymes Genomics Annotation Software
Carbohydrate-active enzymes are produced by all organisms to accomplish enzymatic modification of carbohydrate-containing compound both intra- and extracellularly. Hence, this enzyme group is relevant for understanding central biological processes such as sugar metabolism, protein glycosylation and, on an ecological level, for global biomass synthesis and degradation. It is not surprising that carbohydrate-active enzymes are used in medical and industrial biotechnology. The CAZy database (http://www.cazy.org/) was founded in 1991 and contains a unique classification of carbohydrate-active enzymes including carefully curated information about enzyme sequence, structure and function . Currently, the publicly available information in the CAZy database consists of almost 400.000 unique protein sequences classified in more than 300 families.
Despite the abundant information in the CAZy database, de novo annotation of carbohydrate-active enzymes is not a trivial task. State-of-the-art methods involve automatic identification by matching the sequences of interest to protein models generated directly from sequences in the CAZy database or indirectly from protein domain models from other databases or by BLAST search followed by manual curation of the data [1–4].
Entirely automatic annotation methods have been developed based on hidden Markov model (HMM) recognition of all or a subset of the enzymes in the CAZy database and are available as web-based services [5–7]. E.g., the dbCAN method was made by refining HMM models from the Conserved Domain Database to fit the families in the CAZy database and supplementing the database with new HMM models for the families in the CAZy database that are not modelled in the Conserved Domain Database .
Even when it is possible to annotate a protein to a specific family this does not necessarily allow an exact prediction of its enzymatic activity. This is due to that the classification of the carbohydrate-active enzymes in the CAZy database is based on protein sequence and structure similarity . Thus, in many cases the classification does not reflect enzymatic activity . Hence, proteins with identical enzymatic activity are classified in different families and most of the families contain proteins with different enzymatic activities.
Identification of short, conserved motifs can be used to group related protein sequences and will often pinpoint proteins with the same enzymatic activity [8, 9]. Furthermore, the method Homology to Peptide Pattern (Hotpep) matches the short, conserved motifs to undescribed protein sequences to obtain a fast, sensitive and precise annotation of carbohydrate-active enzymes to families . Moreover, when experimental data on enzymatic activity is available Hotpep allows prediction of the enzymatic activity of the proteins. In practice, the experimental data on enzyme activity collected in the CAZy database can be used to predict the enzymatic activity of approximately 75% of the carbohydrate-active enzymes in a genome with 80% accuracy [9, 10].
We used the method Peptide Pattern Recognition (PPR) to identify short, conserved sequence motifs for all enzyme families in the CAZy database. The peptide patterns were combined with Hotpep to obtain a stand-alone software for automatic annotation and functional prediction of carbohydrate-active enzymes. As an example, to illustrate the workability of the approach, annotation of protein sequences from 12 bacterial and 16 fungal genomes was addressed. Hotpep had an F1 score of 0.86 (sensitivity = 0.88, precision = 0.84) for predicting carbohydrate-active enzymes in 12 bacterial genomes and an F1 score of 0.82 (sensitivity = 0.77, precision = 0.88) for predicting carbohydrate-active enzymes in 16 fungal genomes compared to semiautomatic annotation by the CAZy database tools for carbohydrate-active enzyme annotation [1, 4]. Moreover, Hotpep correctly predicted the activity of 86% of the characterized carbohydrate-active enzymes in the CAZy database.
The carbohydrate binding modules (CBM) are not defined as carbohydrate-active enzymes per se but are carbohydrate binding domains within multidomain carbohydrate-active enzymes . Using short, conserved peptides for the CBM families in the CAZy database Hotpep annotates the CBMs with an F1 score of 0.87.
The Hotpep stand-alone application is available for download from Sourceforge for use on desktop computers with the MS Windows operative system.
The first step was to download sequences for all members of each carbohydrate-active enzyme family in the CAZy database (www.cazy.org ) from Genbank (https://www.ncbi.nlm.nih.gov/ ) in August, 2016. The CBM families were downloaded in February, 2017. Sequences that were 100% redundant or 100% identical to a part of another sequence were removed.
Identification of short, conserved peptides
PPR was used for identification of short, conserved peptides in each family of carbohydrate-active enzymes as previously described [9, 10, 13]. Briefly, for each family PPR found the largest group of proteins that contained at least 10 of 70 conserved hexamer peptides. The length of the conserved peptides (hexamers), the number of conserved peptides per protein (10) and the total number of conserved peptides per group (70) were chosen as they were the conditions that gave the best rate of prediction of protein function in empirical testing of peptide lengths from trimers to decamers, 5 – 40 conserved peptides per protein and 30 – 200 conserved peptides per group . Moreover, the minimum frequency of each conserved peptide in a group was 0.20 as this threshold gives the best rate of prediction of protein function . For CBM domains the parameters 30 conserved hexapeptides per PPR group and 3 conserved peptides per protein were used for PPR analysis.
The first group of proteins identified by this method was named group 1. Next, PPR found the second largest group of proteins, not including any proteins from group 1. This group of proteins was named group 2 and so on. The analysis was stopped when less than five proteins were grouped together.
In this way a number of groups consisting of a list of protein sequences and a list of conserved peptides were generated for each family in the CAZy database. Groups including proteins with a described enzyme activity as reported in the CAZy database were assigned the same function as the enzymes as previously described .
For AA families 9, 10 and 11 the conserved peptide lists of the previously described expanded families were used .
Bacterial strains and accession numbers
Bacteroides cellulosilyticus WH2
Gut and stomach
Caldicellulosiruptor saccharolyticus DSM8903
Deinococcus peraridilitoris DSM19664
Desulfotomaculum gibsoniae DSM7213
Enterobacter lignolyticus SCF1
Tropical forest soil
Melioribacter roseus P3M-2
Wooden surface of a chute
Prevotella ruminicola 23
Rhodococcus jostii RHA1
Ruminiclostridium thermocellum ATCC27405
Teredinibacter turnerae T7901
Intracellular in shipworm
Thermacetogenium phaeum DSM12270
thermophilic anaerobic methanogenic reactor
Thermoanaerobacterium thermosaccharolyticum DSM571
Fungal strains (basidiomycotae) and accession numbers
Annotation with Hotpep
Finding all the conserved peptides from the list that were present in the sequence.
Sum the frequency of these peptides to obtain the group-specific frequency score.
Included three or more conserved peptides from a group.
The frequency score for the peptides was higher than 1.0
The conserved peptides represented at least ten amino acids of the protein sequence.
If a protein satisfied all three conditions it was assigned to the family and to the PPR group with the highest group-specific frequency score. Moreover, if this group had been assigned a function by the PPR analysis, the same function was predicted for the protein .
Hotpep including the conserved peptide patterns described here is available for download as an application for the MS Office operative system from Sourceforge.
Annotation with dbCAN
The protein products from each genome were annotated de novo with the dbCAN web service for protein annotation with standard parameters and with optimized parameters (E-value < 10−18; coverage > 0.35 for bacteria and E-value < 10−17; coverage > 0.45 for fungi) by downloading scripts and HMMs as described (http://csbl.bmb.uga.edu/dbCAN/annotate.php, ).
The following values were calculated for pairwise comparison of two annotation methods:
True positives = Number of hits found by both screening methods. False positives = Number of proteins found by the screening method being tested but not by the reference method. False negatives = Number of proteins found by the reference method but not by the screening method being tested.
Sensitivity was calculated as True positives/(True positives + False negatives); Precision (positive prediction value) was calculated as True positives/(True positives + False positives) and F1 score (the harmonic mean of precision and sensitivity) was calculated as (2 × True positives)/(2 × True positives + False positives + False negatives).
Results and discussion
Short, conserved peptides identified in the carbohydrate-active enzyme from the glycoside hydrolase families in the CAZy database can be used for fast, efficient and reliable approach for annotation by the Hotpep method . Moreover, groups of carbohydrate-active proteins sharing the same short, conserved peptides do often have the same enzymatic activity . Thus, by comparing the rich information on experimentally characterized enzymes in the CAZy database with the PPR grouping of the enzymes it is possible to predict the enzymatic activity of the uncharacterized members of the groups with 80% accuracy. In this way, a functional prediction was obtained for 72% of the annotated glycoside hydrolases in 39 fungal genomes .
To accomplish automatic annotation of all carbohydrate-active enzymes with Hotpep we downloaded all sequences in the families of the five enzyme classes: Carbohydrate esterases (CE), Glycoside hydrolases (GH), Auxiliary activities (AA), Polysaccharide lyases (PL) and Glycosyl transferases (GT). A total of 594,121 accession numbers were found in the CAZy database and reduced to 380,269 non-redundant protein sequences before each family was sorted into groups of proteins sharing up to 70 short, conserved hexapeptides and assignment of function to each group containing more than two functionally characterized members (Additional file 1). In total 36% of the 5590 PPR groups for all enzyme families included functionally characterized proteins. These groups with associated functions contained 65% of the PPR-grouped proteins. For the glycoside hydrolases, 41% of the groups included functionally characterized proteins and a total of 74% of all proteins, in agreement with the previous report of a functional prediction of 72% of the glycoside hydrolases .
For the CBM class of carbohydrate-binding modules we found 71,253 accession numbers in the CAZy database resulting in 45,048 non-redundant protein sequences. Due to the short length of most CBM domains [7, 11] it was uncertain whether the standard parameters of 70 conserved peptides per PPR group and 10 conserved peptides per protein were optimal for annotation of CBMs. Therefore, different parameters for PPR were tested for classification of the isolated CBM domains followed by Hotpep annotation of the full-length proteins and comparison to the annotation in the CAZy database. There was little variation in the F1 score (0.83 - 0.87) within the range of tested parameters (Additional file 2) in agreement with the notion that PPR groups are fairly stable within a large range of parameters . The parameters 30 conserved peptides per PPR group and 3 conserved peptides per protein gave the highest F1 score of 0.87 and were chosen for annotation of CBMs.
Hotpep annotates proteins by matching the lists of conserved peptides of a group to the protein sequences of interest [10, 13, 14]. Any sequence that fulfills a number of criteria (see Implementation) of which the most important is that the sequence should include at least three of the conserved peptides, will be annotated to the protein group. We combined Hotpep with the lists of conserved peptides for all enzyme families in the CAZy database to an application that can identify members of all carbohydrate-active enzyme families and CBMs. The AA9, AA10 and AA11 conserved peptides were substituted with the AA9exp, AA10exp and AA11exp conserved peptides that represent a more complete description of the sequence variation in these families . The complete lists of peptides and frequencies are available for download at Sourceforge together with the accession numbers of the sequences for each group and the library of EC functional scores for each group.
This method correctly predicts 80 – 95% of enzyme activities [9, 10]. To test this further, we used Hotpep to predict the function of 8812 experimentally characterized carbohydrate-active enzymes (Additional file 3). Hotpep correctly predicted the function of 86% of the enzymes. This result supports the previous finding that proteins sharing conserved peptides often but not always have the same activity . Hence, enzymatic activities for individual sequences predicted by Hotpep should be used as a guideline for functional characterization. In an analysis of annotation of glycosyl hydrolases from ORFs in genome fragments with Hotpep it was found that the glycosyl hydrolases that were overlooked by Hotpep could be detected when the full-length amino acid sequence of the enzymes were used for annotation . This finding suggests that more true positive hits are obtained by examining full-length coding regions rather than ORFs containing single exons. To test this notion we compared the annotation of all carbohydrate-active enzymes in seven fungal genomes to annotation of predicted proteins from the same genomes. The fungi were selected to include genome assemblies and predicted proteins from different research groups to avoid methodical bias. The results showed that 31% more carbohydrate-active enzymes were found by annotation of the predicted proteins from the genomes compared to annotation of ORFs in fragments of the genomes (Additional file 4) in agreement with the previous report . Hence, although exon-intron structure of eukaryotic genes makes them difficult to predict  a higher sensitivity in prediction of carbohydrate-active enzymes is obtained by annotating from predicted proteins rather than from ORFs in genome fragments.
Annotation with Hotpep of predicted proteins from 12 bacterial genomes was compared to state-of-the-art semiautomatic annotation reported in the CAZy database . The selected genomes were from bacteria with different lifestyles including bacteria known to degrade extracellular carbohydrates.
Annotation of 12 bacterial genomes
It was reported that automatic identification with the HMM signatures in dbCAN is a highly precise and sensitive method for annotation of carbohydrate-active enzymes . Annotation of the 12 bacterial genomes with the dbCAN web service (http://csbl.bmb.uga.edu/dbCAN/annotate.php) gave a higher number of hits than the annotation in the CAZy database resulting in a sensitivity similar to Hotpep but with lower precision and F1 score (Table 3). However, annotation of the 12 bacterial genomes with the downloaded dbCAN HMMs and optimized parameters  gave a lower number of hits than the annotation in the CAZy database resulting in slightly higher sensitivity, precision and F1 score than Hotpep (Table 3). Thus, although the downloadable dbCAN is more difficult to use than the web service as the user has to both download the dbCAN HMMs and install the HMMER 3.0 package  the extra effort pays of in the form of a more accurate annotation. In summary, the comparison of the annotation methods showed that the CAZy database, Hotpep and downloaded dbCAN were most in agreement whereas the dbCAN web service annotates a higher number of genes as encoding carbohydrate-active enzymes.
To assess the performance of Hotpep for identification of eukaryotic genes, 16 fungal genomes that have been sequenced and annotated by The Joint Genome Institute and the CAZy database tools by Hori et al.  were selected for annotation. Testing on these genomes has the benefit that many of the carbohydrate-active enzymes from these fungi are not part of the CAZy database and has thus not been part of the dataset used to make the conserved peptide patterns used by Hotpep.
Annotation of 16 fungal genomes
The F1 score (0.82) for the comparison of Hotpep with Hori et al.  for the 16 fungal genomes is a little lower than the F1 score (0.86) for the annotation of the 12 bacterial genomes. However, the fungal genomes were all from basidiomycetes that are less represented in the CAZy database than carbohydrate-active enzymes from ascomycetes and thus may be more difficult to annotate. To assess this possibility we used previously published data  to calculate the F1 score for comparison of annotation of six ascomycete genomes by Hotpep and the CAZy database tools for annotation. The few disagreements between the methods were attributed mainly to differences in gene prediction rather than to differences in annotation . In line with this notion, the F1 score for this dataset of ascomycete genes was 0.92 compared to only 0.82 for the annotation of basidiomycete genes in the present study. This finding suggests that the publicly available CAZy database may not yet account for the complete sequence variation in the carbohydrate-active enzyme families. E.g., the basidiomycete sequences may be underrepresented. This is in agreement with the ongoing addition of new sequences to the CAZy database . A simple expansion of the LPMO enzyme families in the CAZy database by including previously unannotated, publicly available sequences led to the identification of the AA11 enzymes  and was shown to give a better representation of the sequence variation of the families, hereby making it possible to identify 31% more LPMOs in 39 fungal genomes . The current version of Hotpep for annotation of carbohydrate-active enzymes include the expanded conserved peptide signatures for the AA9, AA10 and AA11 families. As expanded signatures become available for other families, they will be added to Hotpep.
Hotpep could principally be used for annotation of other enzymes than carbohydrate-active enzymes provided that sufficiently well curated sequence data bases are available.
Hotpep is an easy to use tool that performs automatic annotation of carbohydrate-active enzymes with high success rate. The result of annotation with Hotpep is comparable to state-of-the-art semiautomatic annotation by experts [1, 4] and automatic annotation with HMMs . Furthermore, Hotpep also provides a functional prediction of function directly from amino acid sequence.
A downloadable version of Hotpep is available as a stand-alone application that runs on the MS Windows operative system.
Carbohydrate binding module
Hidden Markov Models
Homology to Peptide Pattern
lytic polysaccharide monooxygenase
Peptide Pattern Recognition
We thank Kristian Barrett for fruitful discussions on enzyme annotation and on the performance of Hotpep.
This work was supported by project no.: Mar 14319 from Nordic Innovation; SYNFERON – from the Danish Innovation Fund and by The Villum Foundation. The funding bodies did not play any role in the design of the study, in the collection, analysis, and interpretation of data or in writing the manuscript.
Availability of data and materials
Project name: Hotpep for Carbohydrate-active enzymes
Project home page: https://sourceforge.net/projects/hotpep/
Operating systems: Windows 7 or higher
Programming language: Ruby 2.2.4
License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).
Any restrictions to use by non-academics: Commercial rights reserved.
PKB wrote the software, downloaded the sequences, made the analysis necessary for to develop Hotpep and performed the comparison of annotations. BP tested the Hotpep algorithm and participated in data requisition and analysis. MJL performed the DBCan annotation and result analysis. ASM discussed the final results and interpretation of the data. LL initiated the study and discussed the final results and interpretation of the data. The manuscript was written by the authors from a draft by PKB. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Lombard V, Golaconda Ramulu H, Drula E, Coutinho PM, Henrissat B. The carbohydrate-active enzymes database (CAZy) in 2013. Nucleic Acids Res. 2014;42:D490–495.View ArticlePubMedGoogle Scholar
- Floudas D, Binder M, Riley R, Barry K, Blanchette RA, Henrissat B, et al. The Paleozoic origin of enzymatic lignin decomposition reconstructed from 31 fungal genomes. Science. 2012;336:1715–9.View ArticlePubMedGoogle Scholar
- Grigoriev IV, Martinez DA, Salamov AA. 5 - Fungal Genomic Annotation. In: Dilip K. Arora RMB and GBS, editor. Applied Mycology and Biotechnology [Internet]. Elsevier; 2006 [cited 2016 Dec 1]. p. 123–42. Available from: http://www.sciencedirect.com/science/article/pii/S1874533406800080
- Hori C, Ishida T, Igarashi K, Samejima M, Suzuki H, Master E, et al. Analysis of the Phlebiopsis gigantea Genome, Transcriptome and Secretome Provides Insight into Its Pioneer Colonization Strategies of Wood. PLoS Genet. 2014;10:e1004759.View ArticlePubMedPubMed CentralGoogle Scholar
- Ekstrom A, Taujale R, McGinn N, Yin Y. PlantCAZyme: a database for plant carbohydrate-active enzymes. Database (Oxford). 2014;2014:bau079.
- Park BH, Karpinets TV, Syed MH, Leuze MR, Uberbacher EC. CAZymes Analysis Toolkit (CAT): Web service for searching and analyzing carbohydrate-active enzymes in a newly sequenced organism using CAZy database. Glycobiology. 2010;20:1574–84.View ArticlePubMedGoogle Scholar
- Yin Y, Mao X, Yang J, Chen X, Mao F, Xu Y. dbCAN: a web resource for automated carbohydrate-active enzyme annotation. Nucleic Acids Res. 2012;40:W445–51.View ArticlePubMedPubMed CentralGoogle Scholar
- Busk PK, Lange L. A Novel Method of Providing a Library of N-Mers or Biopolymers. WO/2012/101151. [Internet]. 2012 [cited 2012 Dec 11]. Available from: http://www.freepatentsonline.com/WO2012101151A1.html
- Busk PK, Lange L. Function-based classification of carbohydrate-active enzymes by recognition of short, conserved peptide motifs. Appl Environ Microbiol. 2013;79:3380–91.View ArticlePubMedPubMed CentralGoogle Scholar
- Busk PK, Lange M, Pilgaard B, Lange L. Several genes encoding enzymes with the same activity are necessary for aerobic fungal degradation of cellulose in nature. PLoS One. 2014;9:e114138.View ArticlePubMedPubMed CentralGoogle Scholar
- Boraston AB, Bolam DN, Gilbert HJ, Davies GJ. Carbohydrate-binding modules: fine-tuning polysaccharide recognition. Biochem J. 2004;382:769–81.View ArticlePubMedPubMed CentralGoogle Scholar
- Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, et al. GenBank. Nucleic Acids Res. 2013;41:D36–42.View ArticlePubMedGoogle Scholar
- Busk PK, Lange L. Classification of fungal and bacterial lytic polysaccharide monooxygenases. BMC Genomics. 2015;16:368.View ArticlePubMedPubMed CentralGoogle Scholar
- Bech L, Busk PK, Lange L. Cell Wall Degrading Enzymes in Trichoderma asperellum Grown on Wheat Bran. Fungal Genom Biol. 2015;4:116.
- Brent MR. How does eukaryotic gene prediction work? Nat Biotechnol. 2007;25:883–5.View ArticlePubMedGoogle Scholar
- Karlsson J, Saloheimo M, Siika-Aho M, Tenkanen M, Penttilä M, Tjerneld F. Homologous expression and characterization of Cel61A (EG IV) of Trichoderma reesei. Eur J Biochem. 2001;268:6498–507.View ArticlePubMedGoogle Scholar
- Watanabe T, Kimura K, Sumiya T, Nikaidou N, Suzuki K, Suzuki M, et al. Genetic analysis of the chitinase system of Serratia marcescens 2170. J Bacteriol. 1997;179:7111–7.View ArticlePubMedPubMed CentralGoogle Scholar
- Hemsworth GR, Henrissat B, Davies GJ, Walton PH. Discovery and characterization of a new family of lytic polysaccharide monooxygenases. Nat Chem Biol. 2014;10:122–6.View ArticlePubMedGoogle Scholar
- Levasseur A, Drula E, Lombard V, Coutinho PM, Henrissat B. Expansion of the enzymatic repertoire of the CAZy database to integrate auxiliary redox enzymes. Biotechnol Biofuels. 2013;6:41.View ArticlePubMedPubMed CentralGoogle Scholar