Extension of the COG and arCOG databases by amino acid and nucleotide sequences
© Meereis and Kaufmann; licensee BioMed Central Ltd. 2008
Received: 03 June 2008
Accepted: 13 November 2008
Published: 13 November 2008
The current versions of the COG and arCOG databases, both excellent frameworks for studies in comparative and functional genomics, do not contain the nucleotide sequences corresponding to their protein or protein domain entries.
Using sequence information obtained from GenBank flat files covering the completely sequenced genomes of the COG and arCOG databases, we constructed NUCOCOG (nucleotide sequences containing COG databases) as an extended version including all nucleotide sequences and in addition the amino acid sequences originally utilized to construct the current COG and arCOG databases. We make available three comprehensive single XML files containing the complete databases including all sequence information. In addition, we provide a web interface as a utility suitable to browse the NUCOCOG database for sequence retrieval. The database is accessible at http://www.uni-wh.de/nucocog.
NUCOCOG offers the possibility to analyze any sequence related property in the context of the COG and arCOG framework simply by using script languages such as PERL applied to a large but single XML document.
Construction and content
Construction of the NUCOCOG database
Content of the three NUCOCOG databases
a. a. a.: B
a. a. a.: U
a. a. a.: X
a. a. a.: Z
a. n.: b
a. n.: d
a. n.: h
a. n.: k
a. n.: m
a. n.: n
a. n.: r
a. n.: s
a. n.: v
a. n.: w
a. n.: y
Construction of the arNUCOCOG database
The arNUCOCOG database (arnucocog.xml) was essentially constructed as described above using the information from ar40.fa and arCOG.csv  to build the initial XML-file. The current version of arCOGs includes the genome of Thermoproteus tenax which has not been published at the time of its release and by request of the sequencing consortium those proteins were removed from the ar40.fa file and are also not contained in the arNUCOCOG database. In addition, 34 sequences listed in arCOG.csv could not be located in ar40.fa. These proteins for various reasons were not translated and the authors detected them by tBLASTn, using an orthologous sequence from a close relative as a query (Kira Makarova, personal communication). We included those sequences manually by reproducing her work. Searching for matching amino acid sequences in the GBK files resulted in including 99.8 % of all nucleotide sequences. The remaining sequences were detected by the alternative methods described above and only three sequences needed to be searched manually. Many of the arCOGs are new and consequently not assigned to a classical COG-number. In all those cases, we included "NO_COG" between the respective tags. Because arCOG.csv contains protein gi-numbers as the domain-ids, no unique domain-ids are assigned to all split sequences. We improved this situation by adding consecutively numbered suffixes to those gi-numbers separated by an underscore e.g. <DOMAINNAME>118430839_1</DOMAINNAME>.
Combining NUCOCOG with arNUCOCOG (NUCOCOG_2)
We also combined NUCOCOG and arNUCOCOG resulting in nucocog_2.xml. For that purpose, we removed all sequences from the 13 archaeal genomes from NUCOCOG and included all data from arNUCOCOG instead. In addition, we added those sequences from ar40.fa that according to the information from arCOG.csv are assigned to classical COGs but are not part of any arCOG. Finally, for those amino acid sequences their corresponding nucleotide sequences were included as described above and "NO_COG" was written between the arCOG-tags.
Content of the NUCOCOG database
The content of the three database files is summarized in table 1. As can be seen, some nucleotide sequences contain stop codons within coding regions and there are both ambiguous amino acids (a. a. a.) and ambiguous nucleotides (a. n.). Consequently, in those cases a distinct translation of a codon to an amino acid is impossible. For that reason, although resulting in larger database files containing redundancies, we did not delete the amino acid sequences from our files after the databases had been constructed.
Most of the users will probably use the set of databases according to the aim we primarily constructed it for:i. e. by downloading the XML files and analyzing them with respect to their own research questions and their individually developed software tools. Nevertheless, we also provide a web based utility to browse the databases for sequence retrieval by COG-number, arCOG-number, protein name, and GI-number. For that purpose, we used the Apache HTTP Server and an SQL backend. The XML files were converted to tables of an SQL database, one for all COG data and the other ones for the nucleotide and amino acid sequences, respectively. The names of the protein or protein domains were used as unique keys. Queries can be made by using the frontend written in PHP providing the option to select certain entries for displaying their corresponding amino acid and nucleotide sequences.
The COG and arCOG databases represent excellent collections of proteins (or protein domains). The version presented here including amino acid and nucleotide sequences allows answering all sequence related questions with respect to orthologous proteins i. e. proteins that are assumed to exhibit identical functions. For instance, one may ask whether enzymes involved in a certain metabolic pathway have constraints in their amino acid composition. This is described for enzymes involved in tryptophan biosynthesis since the 5 protein chains of the E. coli trp operon contain only 5 tryptophan residues . Indeed, the "cognate bias hypothesis" stating that early in evolutionary history the biosynthetic enzymes for amino acid × gradually lost residues of ×  could elegantly be tested using the NUCOCOG files presented here. Questions related to deviations of nucleotide sequence compositions such as codon usage or GC-content in dependence on the functions of the respective proteins could also be answered by exploring the XML files provided here. Furthermore, the COG framework had proved to be a powerful tool in conjunction with phylogenetic protein sequence distributions [16, 17]. The possibility to examine clade specific features of nucleotide or amino acid sequences within the COG context could also uncover more precise data than those made available by simply comparing the sequences of whole genomes. For example, there are several studies dealing with differences in sequence specific properties between (hyper)thermophiles and mesophiles by comparing the sequence data of their complete genomes [18–20]. Those surveys do not account for possible differences in sequence signatures that depend solely on the function of the respective protein rather than the phylogenetic relationship of the organisms under investigation. To refine such studies, only proteins derived from different organisms but exhibiting identical biochemical functions should be compared on a large scale rather than just comparing complete genomes. With that intention we constructed NUCOCOG and our future work will exactly deal with the refinement described here of detecting thermophile-specific sequence signatures considering possible distortions due to comparing functionally different proteins.
NUCOCOG is a version of the current COG and arCOG databases assembled in single XML files containing both amino acid and nucleotide sequences associated to their respective entries. In depth analysis of this XML files makes it possible to investigate any sequence specific property in the COG context, taking into account functional and phylogenetic relationships.
Availability and requirements
American standard code for information interchange
basic local alignment search tool
cluster of orthologous groups
file transfer protocol
GenBank (file-extension .gbk)
hypertext transfer protocol
National Center for Biotechnology Information
nucleotide sequences containing COG
practical extraction and report language
PHP hypertext pre-processor
structured query language
uniform resource locator
extensible markup language
We thank Daniela Kaufmann for her help in editing the manuscript and the staff at Bereich für Informationstechnologie at Witten/Herdecke University for supporting us to implement NUCOCOG on our web servers.
- Tatusov RL, Koonin EV, Lipman DJ: A genomic perspective on protein families. Science (New York, NY) 1997, 278(5338):631–637.View ArticleGoogle Scholar
- Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, et al.: The COG database: an updated version includes eukaryotes. BMC bioinformatics 2003, 4: 41. 10.1186/1471-2105-4-41PubMed CentralView ArticlePubMedGoogle Scholar
- Tatusov RL, Galperin MY, Natale DA, Koonin EV: The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic acids research 2000, 28(1):33–36. 10.1093/nar/28.1.33PubMed CentralView ArticlePubMedGoogle Scholar
- Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, Kiryutin B, Galperin MY, Fedorova ND, Koonin EV: The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic acids research 2001, 29(1):22–28. 10.1093/nar/29.1.22PubMed CentralView ArticlePubMedGoogle Scholar
- Kaufmann M: The Role of the COG Database in Comparative and Functional Genomics. Current Bioinformatics 2006, 1(3):291–300. 10.2174/157489306777828017View ArticleGoogle Scholar
- Makarova KS, Sorokin AV, Novichkov PS, Wolf YI, Koonin EV: Clusters of orthologous genes for 41 archaeal genomes and implications for evolutionary genomics of archaea. Biology direct 2007, 2: 33. 10.1186/1745-6150-2-33PubMed CentralView ArticlePubMedGoogle Scholar
- Rutherford K, Parkhill J, Crook J, Horsnell T, Rice P, Rajandream MA, Barrell B: Artemis: sequence visualization and annotation. Bioinformatics (Oxford, England) 2000, 16(10):944–945. 10.1093/bioinformatics/16.10.944View ArticleGoogle Scholar
- Informatics Software: Artemis[http://www.sanger.ac.uk/Software/Artemis/]
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. Journal of molecular biology 1990, 215(3):403–410.View ArticlePubMedGoogle Scholar
- Zalkin H, Yanofsky C: Yeast gene TRP5: structure, function, regulation. The Journal of biological chemistry 1982, 257(3):1491–1500.PubMedGoogle Scholar
- Perlstein EO, de Bivort BL, Kunes S, Schreiber SL: Evolutionarily conserved optimization of amino acid biosynthesis. Journal of molecular evolution 2007, 65(2):186–196. 10.1007/s00239-007-0013-xView ArticlePubMedGoogle Scholar
- Reichard K, Kaufmann M: EPPS: mining the COG database by an extended phylogenetic patterns search. Bioinformatics (Oxford, England) 2003, 19(6):784–785. 10.1093/bioinformatics/btg089View ArticleGoogle Scholar
- Meereis F, Kaufmann M: PCOGR: phylogenetic COG ranking as an online tool to judge the specificity of COGs with respect to freely definable groups of organisms. BMC bioinformatics 2004, 5: 150. 10.1186/1471-2105-5-150PubMed CentralView ArticlePubMedGoogle Scholar
- Cambillau C, Claverie JM: Structural and genomic correlates of hyperthermostability. The Journal of biological chemistry 2000, 275(42):32383–32386. 10.1074/jbc.C000497200View ArticlePubMedGoogle Scholar
- Suhre K, Claverie JM: Genomic correlates of hyperthermostability, an update. The Journal of biological chemistry 2003, 278(19):17198–17202. 10.1074/jbc.M301327200View ArticlePubMedGoogle Scholar
- Singer GA, Hickey DA: Thermophilic prokaryotes have characteristic patterns of codon usage, amino acid composition and nucleotide content. Gene 2003, 317(1–2):39–47. 10.1016/S0378-1119(03)00660-7View ArticlePubMedGoogle Scholar
- The Protein Chemistry Group·NUCOGOG online[http://www.uni-wh.de/nucocog/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.