- Open Access
The B6 database: a tool for the description and classification of vitamin B6-dependent enzymatic activities and of the corresponding protein families
BMC Bioinformaticsvolume 10, Article number: 273 (2009)
Enzymes that depend on vitamin B6 (and in particular on its metabolically active form, pyridoxal 5'-phosphate, PLP) are of great relevance to biology and medicine, as they catalyze a wide variety of biochemical reactions mainly involving amino acid substrates. Although PLP-dependent enzymes belong to a small number of independent evolutionary lineages, they encompass more than 160 distinct catalytic functions, thus representing a striking example of divergent evolution. The importance and remarkable versatility of these enzymes, as well as the difficulties in their functional classification, create a need for an integrated source of information about them.
The B6 database http://bioinformatics.unipr.it/B6db contains documented B6-dependent activities and the relevant protein families, defined as monophyletic groups of sequences possessing the same enzymatic function. One or more families were associated to each of 121 PLP-dependent activities with known sequences. Hidden Markov models (HMMs) were built from family alignments and incorporated in the database. These HMMs can be used for the functional classification of PLP-dependent enzymes in genomic sets of predicted protein sequences. An example of such analyses (a census of human genes coding for PLP-dependent enzymes) is provided here, whereas many more are accessible through the database itself.
The B6 database is a curated repository of biochemical and molecular information about an important group of enzymes. This information is logically organized and available for computational analyses, providing a key resource for the identification, classification and comparative analysis of B6-dependent enzymes.
The term 'vitamin B6' refers to a collective of six biologically interconvertible 3-hydroxy-2-methylpyridine compounds: pyridoxal, pyridoxine, pyridoxamine, and their respective 5'-phosphates. Among these, pyridoxal 5'-phosphate (PLP) is the main metabolically active form, serving as a cofactor for a variety of enzymes in all organisms [1–7].
Nearly all PLP-dependent enzymes, with the exception of glycogen phosphorylases, are associated with biochemical pathways involving amino compounds - mostly amino acids. The reactions catalyzed by the PLP-dependent enzymes that act on amino acids include transamination, decarboxylation, racemization, and eliminations or replacements at the β- or γ-carbons. Such versatility arises from the fact that PLP can covalently bind the substrate and then act as an electrophilic catalyst, stabilizing different types of carbanionic reaction intermediates  (Figure 1).
The Enzyme Commission (EC; http://www.chem.qmul.ac.uk/iubmb/enzyme/) lists more than 140 PLP-dependent activities, corresponding to ~4% of all classified activities . Despite this wide functional variety, all structurally characterized PLP-dependent enzymes have been classified into just five distinct structural groups (also known as 'fold types') [4, 8], which presumably correspond to independent evolutionary lineages [3, 5]. This represents a remarkable example of divergent evolution, meaning that proteins with similar structure and sequence can perform different chemical reactions. Due to the mechanistic similarities between PLP-dependent enzymes and to their limited structural diversity, inferring the function of these catalysts solely based on sequence similarity entails particular difficulties.
To help the identification and classification of sequences belonging to PLP-dependent enzymes, we have created the B6 database. In addition to a wealth of links to other Internet resources (including BRENDA  and the PLP mutant enzyme database), the B6 database contains over 180 documented PLP-dependent activities that are associated, when possible, to one or more protein families (defined as monophyletic groups of homologous proteins sharing the same function). The database also contains hidden Markov models (HMMs) that were built from family alignments and that can be employed for the identification and functional classification of PLP-dependent enzymes in genomic sets of protein sequences. Indeed, we have used these HMMs to scan a series of complete genomes, obtaining a census of predicted PLP-dependent enzymes in various organisms.
Construction and content
Organization and statistics of the B6 database
Figure 2 summarizes the structure of the database, illustrating the types of information it includes and the ways in which this information is linked together and can be searched. As shown, the B6 database site actually accesses and integrates four distinct databases, namely a list of PLP-dependent activities, a collection of pertinent literature references, a large set of sequences of PLP-dependent proteins (grouped into protein families) and the results of our genomic searches.
The B6 database release 1.0 (as of 15/05/2009) includes 184 activities and over 2000 sequences of B6-dependent enzymes, subdivided into 149 families. For each family, the database provides a multiple sequence alignment and the derived hidden Markov model.
Assembly of the databases: activities, sequences and protein families
The B6 database was constructed based on an inventory of documented B6-dependent activities, most but not all of which have been catalogued by the Enzyme Commission and are therefore associated to an official EC number. A systematic examination of the literature showed that 121 of these activities could be associated to enzymes of known sequences, and in these cases we proceeded to the creation of protein families, that we define as monophyletic groups of sequences all possessing the same enzymatic activity. Each given activity was associated to one or more families based on this criterion.
The number of sequences in individual families was then increased by homology searches, i.e. by scanning the GenBank with BLAST  or with psi-BLAST , using as query the functionally validated protein(s). Criteria for inclusion of a sequence in a family were the following:
Only sequences yielding an E value < 10-10 were generally considered (this limit could be somewhat lowered for families composed of short sequences).
Sequences showing a >90% identity to a protein of known function were usually not included, to diminish redundancy.
Sequences being substantially (>30%) shorter than the shortest functionally validated sequence in the family were discarded. Sequences lacking the PLP-binding lysine residue were also discarded (except for rare cases in which the protein is known not to bind PLP via a lysine).
Sequences showing a higher similarity to other characterized PLP-dependent enzymes (i.e., to some functionally validated protein belonging to another family) were discarded.
Finally, sequences from taxa in which the enzymatic activity of the family was not documented, were also generally discarded.
Multiple alignments were constructed with ClustalW . Given that the families were composed of closely related sequences, these alignments did not need to be manually adjusted or to be guided by structural information (even when available).
The ProDom program  was used for alignment inspection and phylogenetic analysis. Family alignments were used to build Hidden Markov Models (HMM) with programs of the HMMER suite . The scores of sequences included or excluded from a given family were then calculated with respect to the family HMM. From this procedure, score cut-offs for each family were determined and then used for sequence classification.
A family HMM is a probabilistic model, constructed from a multiple alignment, which describes the sequence conservation within a protein family. In comparison to consensus sequences or similar regular expressions, HMMs provide a more articulated modeling of the features of a protein family. Such higher complexity is responsible for the greater discriminatory power of the HMM methodology in the identification of other putative family members . Depending on family inclusion criteria and score thresholds, HMMs can be used to identify homology at different levels of granularity. The 'family' definition adopted in the B6 database is similar to the 'equivalog family' definition of TIGRFAM , while a single family in PFAM  typically corresponds to many different families in our database.
Cluster analysis of PLP-dependent enzyme families
To elucidate the relationships between the 149 enzyme families defined as above, we performed an all versus all comparison of the families in the database using an HMM-HMM alignment software . The results of this comparison were analyzed with an interaction network software  to build an homology-based network of PLP-dependent families (Figure 3). By considering only significant similarities (E < 10-5) between HMMs, the analysis identified seven separated clusters of PLP-dependent families (Figure 3). Five of these clusters corresponded to the traditional classification of PLP-dependent enzymes into five distinct structural groups (fold types I to V). Of the two additional clusters, one included lysine 5,6-aminomutase (EC: 18.104.22.168) and the other lysine 2,3-aminomutase (EC: 22.214.171.124) - two enzymes whose structures have been recently determined and found to be different from the known structures of PLP-dependent enzymes [20, 21]. In the database, the protein families belonging to these two clusters were assigned, respectively, to fold types VI and VII.
Since HMM-HMM comparison is very sensitive to sequence similarity, it can reveal faint evolutionary relationships between protein families. This information can be particularly useful to identify relatives for PLP-dependent families that fail to reveal similarity with other families when analyzed by sequence-sequence (e.g., Blast) or sequence-HMM (e.g. HMMPFAM) methods. The HMM-HMM analysis, for example, indicates a significant similarity between Prosc (a family of proteins with unknown function) and diaminopimelate decarboxylase (EC: 126.96.36.199) - a relationship that is not apparent through Blast or HMMPFAM comparisons.
Inter-families distances deriving from HMM-HMM comparisons served as a guide to build alignments representative of the seven distinct structural groups. Distance matrices among families were analyzed with an UPGMA algorithm and a rapid multiple sequence alignment method  was used to progressively align PLP-dependent families belonging to the same structural type. From these alignments, we constructed HMMs (hereafter named "fold-type HMMs") representative of the seven structural groups of PLP-dependent enzymes.
Utility and discussion
The B6 database is a repository in which detailed (biochemical and genetic) information about an important group of enzymes is concentrated, organized and made available for computational analyses. We expect that the B6 database will be a valuable tool for experimental researchers in the PLP field, but also a reference point for the design of theoretical studies by bioinformaticians.
In particular, the sequence information accumulated in the database can be used to facilitate the identification and functional assignment of B6-dependent enzymes. To illustrate this point, we employed the family and fold-type HMMs (constructed as described above) to search and preliminarily classify PLP-dependent enzymes in genomic sets of predicted proteins. The results of such analyses have also been incorporated in the database.
Complete sets of protein sequences deduced from genomic data were generally obtained from NCBI ftp://ftp.ncbi.nih.gov/genomes or from similar ftp repositories. The classification of protein sequences was achieved through a two-step procedure. First, each sequence was compared with our database of PLP-dependent sequences by performing a HMM search with the seven fold-type HMMs, using relaxed significance criteria (E ≤ 10-1; database size = 10000). This step served as a quick filter to sift out genes that were likely to code for PLP-dependent enzymes. Candidates were subsequently compared with the library of family HMMs using HMMPFAM. This step was more time-consuming and served for a preliminary functional classification of the proteins.
A protein was considered to possess the same activity as its best-hit family if it exhibited a significant similarity to the family HMM (E ≤ 10-3) and a score above a 'trusted' cut-off established by the family curator. Sequences with a score below this threshold were marked as 'low-score' to indicate their modest similarity to the family model. These sequences were not considered as possessing the enzymatic function of the family, but were regarded as possessing an uncharacterized, possibly related, activity. According to this analysis, very few sequences exhibited a significant similarity to a fold-type HMM (E ≤ 10-3) but no significant similarity to any family HMMs. In such cases, sequences were considered as potential PLP-dependent enzymes with an uncharacterized catalytic activity.
To further characterize the protein sequence under examination, the classification program searched for a putative PLP-binding lysine residue (see legend of Figure 1). This was achieved by aligning the sequence with validated family members in which the position of the catalytic lysine had been previously mapped. This analysis can reveal proteins that are evolutionary related to PLP-dependent enzymes, but have lost the ability to bind the PLP cofactor.
Example: a census of human genes that encode PLP-dependent enzymes
By employing the approach outlined above, we searched the latest draft of the human genome (NCBI 36 assembly, downloaded at ftp://ftp.ensembl.org/pub/) to obtain an inventory of the human genes coding for PLP-dependent enzymes. The initial output of the program (69 sequences recognized as probable PLP-dependent proteins) was further analyzed to identify pseudogenes, false positives and entries representing alternative protein isoforms.
The search identified 56 expressed genes coding for PLP-dependent proteins (Table 1. Note that the products of genes SPTLC1, ADC and AZIN1, albeit homologs of bona fide PLP-dependent enzymes, appear to have acquired a nonenzymic function during evolution). Thirteen more proteins were recognized as isoforms deriving from some of the genes above. To appreciate the rate of false negatives in our analysis, we performed an extensive text search in the GenBank database of human genes, to identify all those genes annotated (directly or indirectly) to code for B6-dependent proteins. However, we found no hits other than the 56 genes listed in Table 1, which therefore represent, to the best of our current knowledge, the complement of human PLP-dependent genes.
We also compared the functional classification provided by the B6 database with the manual annotation included in the NCBI 36 release of the human genome, finding no significant differences. This implies that the accuracy of our automatic classification system can match that of a manual expert annotation. It should be noted that only a minority of complete genomes have been subjected to accurate manual annotation. In genomes where proteins have been mostly annotated through a general system of automatic annotation, our specialized tool provides a more complete and accurate classification of PLP-dependent enzymes.
Of course, accuracy in the annotation of a gene product does not always guarantee a precise functional assignment, as it can be gleaned by inspecting Table 1. For example, some of the human PLP-dependent proteins in our inventory are homologs of enzymes (such as plant ACS synthases or bacterial threonine synthases) that are not expected to occur in mammals. In other cases, the proteins are homologs of other (functionally validated) human enzymes, but it is unclear whether they represent true isozymic forms, or rather possess distinct catalytic activities - this latter possibility may be especially pertinent for those sequences that were recognized as 'low-score' by our search procedure. These uncharacterized gene products represent therefore interesting subjects for functional genomic studies.
Some genes encoding for PLP-dependent enzymes may be missing from the list, possibly due to the limits of the current human genome assembly, even eight years after publication of the first genome draft . For example, the gene ACCSL has been recognized as protein-coding only in the NCBI 36 assembly but was absent in the preceding version (NCBI 35).
The increasing number of predicted protein sequences generated by genomic sequencing projects require methods to predict details regarding functions. The B6 database allows the comparison of newly sequenced PLP-dependent proteins with a curated collection of protein families, making it more reliable a preliminary functional classification but also helping to pinpoint gene products that are the most interesting candidates to functional studies.
Due to the progresses of functional genomics, as well as to classical biochemical and genetic approaches, the body of information on PLP-dependent enzymes is necessarily going to increase. Many activities that are currently 'orphan' (i.e., with no molecular details about the responsible enzymes) will be associated to specific sequences, while many new activities are likely to be discovered . Accordingly, we expect to periodically update and expand the B6 database with the ensuing information, to maintain this database a serviceable tool and a reference point for the scientific community.
Availability and requirements
The B6 database, which is based on the web-oriented Perl package Woda, is publicly available over the Internet http://bioinformatics.unipr.it/B6db. Users are asked to cite the present article.
hidden Markov model
John RA: Pyridoxal phosphate-dependent enzymes. Biochim Biophys Acta 1995, 1248: 81–96.
Jansonius JN: Structure, evolution and action of vitamin B6-dependent enzymes. Curr Opin Struct Biol 1998, 8: 759–769. 10.1016/S0959-440X(98)80096-1
Mehta PK, Christen P: The molecular evolution of pyridoxal-5'-phosphate-dependent enzymes. Adv Enzymol 2000, 74: 129–184.
Schneider G, Kack H, Lindqvist Y: The manifold of vitamin B6 dependent enzymes. Structure Fold Des 2000, 8: R1–6. 10.1016/S0969-2126(00)00085-X
Christen P, Mehta PK: From cofactor to enzymes. The molecular evolution of pyridoxal-5'-phosphate-dependent enzymes. Chem Rec 2001, 1: 436–447. 10.1002/tcr.10005
Percudani R, Peracchi A: A genomic overview of pyridoxal-phosphate-dependent enzymes. EMBO Rep 2003, 4: 850–854. 10.1038/sj.embor.embor914
Toney MD: Reaction specificity in pyridoxal phosphate enzymes. Arch Biochem Biophys 2005, 433: 279–287. 10.1016/j.abb.2004.09.037
Grishin NV, Phillips MA, Goldsmith EJ: Modeling of the spatial structure of eukaryotic ornithine decarboxylases. Protein Sci 1995, 4: 1291–1304. 10.1002/pro.5560040705
Schomburg I, Chang A, Hofmann O, Ebeling C, Ehrentreich F, Schomburg D: BRENDA: a resource for enzyme data and metabolic information. Trends Biochem Sci 2002, 27: 54–56. 10.1016/S0968-0004(01)02027-8
Di Giovine P: PLPMDB: pyridoxal-5'-phosphate dependent enzymes mutants database. Bioinformatics 2004, 20: 3652–3653. 10.1093/bioinformatics/bth399
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215: 403–410.
Altschul SF, Koonin EV: Iterated profile searches with PSI-BLAST--a tool for discovery in protein databases. Trends Biochem Sci 1998, 23: 444–447. 10.1016/S0968-0004(98)01298-5
Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994, 22: 4673–4680. 10.1093/nar/22.22.4673
Corpet F, Servant F, Gouzy J, Kahn D: ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons. Nucleic Acids Res 2000, 28: 267–269. 10.1093/nar/28.1.267
Eddy SR: Profile hidden Markov models. Bioinformatics 1998, 14: 755–763. 10.1093/bioinformatics/14.9.755
Selengut JD, Haft DH, Davidsen T, Ganapathy A, Gwinn-Giglio M, Nelson WC, Richter AR, White O: TIGRFAMs and Genome Properties: tools for the assignment of molecular function and biological process in prokaryotic genomes. Nucleic Acids Res 2007, 35: D260–264. 10.1093/nar/gkl1043
Sammut SJ, Finn RD, Bateman A: Pfam 10 years on: 10,000 families and still growing. Brief Bioinform 2008, 9: 210–219. 10.1093/bib/bbn010
Soding J: Protein homology detection by HMM-HMM comparison. Bioinformatics 2005, 21: 951–960. 10.1093/bioinformatics/bti125
Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 2003, 13: 2498–2504. 10.1101/gr.1239303
Berkovitch F, Behshad E, Tang KH, Enns EA, Frey PA, Drennan CL: A locking mechanism preventing radical damage in the absence of substrate, as revealed by the x-ray structure of lysine 5,6-aminomutase. Proc Natl Acad Sci USA 2004, 101: 15870–15875. 10.1073/pnas.0407074101
Lepore BW, Ruzicka FJ, Frey PA, Ringe D: The x-ray crystal structure of lysine-2,3-aminomutase from Clostridium subterminale. Proc Natl Acad Sci USA 2005, 102: 13819–13824. 10.1073/pnas.0505726102
Katoh K, Misawa K, Kuma K, Miyata T: MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 2002, 30: 3059–3066. 10.1093/nar/gkf436
Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, et al.: Initial sequencing and analysis of the human genome. Nature 2001, 409: 860–921. 10.1038/35057062
Hanada K: Serine palmitoyltransferase, a key enzyme of sphingolipid metabolism. Biochim Biophys Acta 2003, 1632: 16–30.
Donini S, Ferrari M, Fedeli C, Faini M, Lamberto I, Marletta AS, Mellini L, Panini M, Percudani R, Pollegioni L, et al.: Recombinant production of eight human cytosolic aminotransferases and assessment of their potential involvement in glyoxylate metabolism. Biochem J 2009, 422: 265–272. 10.1042/BJ20090748
Donini S, Percudani R, Credali A, Montanini B, Sartori A, Peracchi A: A threonine synthase homolog from a mammalian genome. Biochem Biophys Res Commun 2006, 350: 922–928. 10.1016/j.bbrc.2006.09.112
Koch KA, Capitani G, Gruetter MG, Kirsch JF: The human cDNA for a homologue of the plant enzyme 1-aminocyclopropane-1-carboxylate synthase encodes a protein lacking that activity. Gene 2001, 272: 75–84. 10.1016/S0378-1119(01)00533-9
Kanerva K, Makitie LT, Pelander A, Heiskala M, Andersson LC: Human ornithine decarboxylase paralogue (ODCp) is an antizyme inhibitor but not an arginine decarboxylase. Biochem J 2008, 409: 187–192. 10.1042/BJ20071004
Koguchi K, Kobayashi S, Hayashi T, Matsufuji S, Murakami Y, Hayashi S: Cloning and sequencing of a human cDNA encoding ornithine decarboxylase antizyme inhibitor. Biochim Biophys Acta 1997, 1353: 209–216.
This work was supported by the Italian Ministry of Education, University and Research (COFIN 2005 and 2007). The authors thank Andrea Mozzarelli for support, and Francesca Ravasini, Francesco Gandolfi and Daniela Lazzaretti for technical assistance.
RP designed and implemented the B6 database and website, carried out the genomic analysis and revised the manuscript. AP collected the literature included in the database, selected the functionally validated sequences, helped to build the families, drafted and revised the manuscript. Both authors read and approved the final manuscript.