ANCAC: amino acid, nucleotide, and codon analysis of COGs – a tool for sequence bias analysis in microbial orthologs
© Meiler et al.; licensee BioMed Central Ltd. 2012
Received: 14 May 2012
Accepted: 6 September 2012
Published: 8 September 2012
The COG database is the most popular collection of orthologous proteins from many different completely sequenced microbial genomes. Per definition, a cluster of orthologous groups (COG) within this database exclusively contains proteins that most likely achieve the same cellular function. Recently, the COG database was extended by assigning to every protein both the corresponding amino acid and its encoding nucleotide sequence resulting in the NUCOCOG database. This extended version of the COG database is a valuable resource connecting sequence features with the functionality of the respective proteins.
Here we present ANCAC, a web tool and MySQL database for the analysis of amino acid, nucleotide, and codon frequencies in COGs on the basis of freely definable phylogenetic patterns. We demonstrate the usefulness of ANCAC by analyzing amino acid frequencies, codon usage, and GC-content in a species- or function-specific context. With respect to amino acids we, at least in part, confirm the cognate bias hypothesis by using ANCAC’s NUCOCOG dataset as the largest one available for that purpose thus far.
Using the NUCOCOG datasets, ANCAC connects taxonomic, amino acid, and nucleotide sequence information with the functional classification via COGs and provides a GUI for flexible mining for sequence-bias. Thereby, to our knowledge, it is the only tool for the analysis of sequence composition in the light of physiological roles and phylogenetic context without requirement of substantial programming-skills.
Within the COG database, orthologous protein sequences, i. e. sequences assumed to achieve the same biochemical function, are assigned to individual clusters [1–4]. Although more than thousand microbial genomes are completely sequenced by now, the current COG database containing only 66 representatives still plays an important role in genomics, especially when functional aspects of proteins are taken into consideration . In addition to the classical COG, the arCOG database has been constructed representing a refinement with respect to 41 archaeal genomes . Both databases originally contained no sequence information directly connected with protein names. This drawback was recently eliminated by the construction of NUCOCOG, the nucleotide sequences containing COG database . Thus, NUCOCOG allows linking sequence signatures of both amino acid and nucleotide sequences with functional aspects of the corresponding proteins. The database consists of three XML-files: nucocog.xml containing the classical COG-database, arnucocog.xml containing the arCOG database, and nucocog2.xml, containing a non-redundant combination of both the classical COG and arCOG databases. This report describes ANCAC, a web-tool capable of mining the NUCOCOG database with respect to the frequencies of amino acids, nucleotides and codons within COG sequences. The user selects i) the database, ii) the type of sequence feature to be analyzed (amino acid, nucleotide or codon), iii) a sequence feature or a set of sequence features of that type, and iv) a set of organisms the sequences of which are to be considered. The web-interface then calculates the absolute and relative abundances of the selected feature(s) in the sequences of each COG and returns a ranking with respect to the calculated abundance indexes. To demonstrate the usefulness of ANCAC, we confirmed earlier findings about correlations between sequence composition and growth temperature or oxygen metabolism as well as parts of the “cognate bias hypothesis”  which states that early in evolutionary history the biosynthetic enzymes for amino acid x gradually lost residues of x, thereby reducing the threshold for deleterious effects of x scarcity .
The user starts here by specifying which database, i. e. COG, arCOG or both, should be the basis of the analysis. Depending on the chosen database a selection of available organisms is displayed. Patterns of organisms to be analyzed can then be specified by either manually selecting particular microorganisms of interest or by making use of one of three input aids provided by the tool. Here, groups of organisms can be selected i) by taxon on any level using a hierarchical taxonomic tree, ii) by traits or environmental conditions using pre-defined patterns or iii) by the occurrence of organisms in a COG of choice. Selections by several criteria can be made consecutively in order to form a superset of organisms, i. e. by logical AND operations.
In this tab, the type of sequence feature is selected, i. e. amino acid, nucleotide or codon. Subsequently, any set of features of that type can be defined.
“Normalize against all sequences in the selected database” determines a frequency score by dividing the APSF by the average percentage of the selected features calculated for all sequences within the whole selected database.
“Normalize against the sequences from selected organisms only” determines the frequency score by dividing the APSF by the average percentage of the selected features calculated for the sequences within the selected organisms only.
By default one score is calculated for each COG using the sequences of all selected organisms in this COG. In order to add taxonomic resolution “subgroup the organism-selection by taxonomoc rank” can be chosen assigning sequences to groups along the taxonomic level of choice. Thus, one COG score will be calculated for each taxon occurring in the organism selection at the specified level.
A different mode of analysis from what has been described above is provided by “Batch processing by groups of COGs and sequence-features”. Here, the user can precisely define queries by text input obeying simple formatting rules. This allows calculating scores for any combination of sequence features within selected sets of COGs. Most importantly, cumulative scores for user-defined groups of COGs allow the detection of sequence bias in arbitrary biological contexts such as biochemical pathways or cellular location.
The server calculates a score for the sequences of each COG or optionally each group of COGs in case of batch processing. Only sequences derived from the organisms that have been selected are considered for computation. The APSFs are calculated by summing up pre-calculated feature counts and sequence lengths which are stored in the database for each sequence separately. To interpret sequence bias, the frequencies obtained are optionally normalized as described above.
Results and discussion
ANCAC is a tool for analyzing the sequences in COGs in a functional and phylogenetic context by allowing the user to freely determine organisms and sequence features in any possible combination. In order to demonstrate the power of ANCAC and to make use of its larger pool of organisms and sequences we now briefly re-examine relationships between sequence features and protein function or sequence signatures and biological context already published, however without any claim to completeness.
Positively charged amino acids and ribosomal proteins
The integrity of the ribosome is based on the complex interaction between ribosomal proteins and ribosomal RNA. Here electrostatic interactions between numerous arginine and lysine residues, particularly those in tail extensions, and the phosphate groups of the RNA backbone mediate many protein-RNA contacts . Ranking the COGs of all species contained in the COG database according to their relative abundance of positively charged amino acids (K, H, and R) almost exclusively yields top-scores for COGs containing ribosomal proteins, demonstrating the power of ANCAC to link sequence composition with protein function (Figure 2).
Cognate bias hypothesis
Amino acid composition and growth temperature
Codon usage and growth temperature
GC content and aerobiosis
Naya et al. reported that aerobic prokaryotes display a significant increment in genomic GC content in relation to anaerobic ones . Querying the COG database by ANCAC and selecting the pre-defined patterns “Aerobia” and “Anaerobia” we verify these findings conveniently resulting in 53.94% and 44.78% in genomic GC, respectively.
The re-evaluations presented above are good examples for demonstrating possible applications of ANCAC. Further studies i. e. concerning the abundance of oxygen containing amino acids derived from aerobic versus anaerobic synthetic pathways or the abundance of methionine and cysteine in sulphur anabolic pathways and many more are feasible. Even bias of amino acid composition of proteins differentially expressed during different metabolic states could be detected by ANCAC in batch processing mode. Such bias has already been reported for Saccharomyces cerevisiae during oxidative and reductive energy-yielding reactions . Although the COG database has become a standard for ‘uniform-function’ protein groups , it contains only 66 representative genomes. The organism coverage of the COG and archaeal COG databases currently implemented into ANCAC is therefore a limitation of the tool.
Many studies dealing with links between sequence features such as nucleotide, amino acid, and codon frequencies and functional aspects of proteins as well as biological or phylogenetic issues have been published so far. All of them have required intensive programming work since there has been no software-tool to directly and simply perform such computations. ANCAC, although currently limited to the data of the COG and arCOG databases, ultimately fills this gap.
Availability and requirements
Project name: aminoacid, nucleotide, and codon analysis of COGs (ANCAC)
Project homepage: http://www.uni-wh.de/ancac
Operating system(s): Platform independent
License: GNU general public licence
Any restrictions to use by non-academics: contact authors
Aminao acid, nucleotide, and codon analysis of COGs
Average percentage of the sequence features selected
Archaeal clusters of orthologous genes
Archaeal nucleotide containing clusters of orthologous genes database
Common gateway interface
Cluster of orthologous groups
Cascading style sheets
Graphical user interface
Hypertext transfer protocol
Nucleotide containing COG database
Optimum growth temperature
Practical extraction and report language
Structured query language
Extensible markup language.
We thank Daniela Kaufmann for her help in editing the manuscript, Fatemeh Gholamrezaei for her excellent Excel spreadsheets, and the staff at Bereich für Informationstechnologie (BIT) at Witten/Herdecke University for supporting us in implementing ANCAC on our web-servers.
- Tatusov RL, Koonin EV, Lipman DJ: A genomic perspective on protein families. Science 1997, 278(5338):631–637. 10.1126/science.278.5338.631View ArticlePubMedGoogle Scholar
- Tatusov RL, Galperin MY, Natale DA, Koonin EV: The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res 2000, 28(1):33–36. 10.1093/nar/28.1.33PubMed CentralView ArticlePubMedGoogle Scholar
- Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, Kiryutin B, Galperin MY, Fedorova ND, Koonin EV: The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res 2001, 29(1):22–28. 10.1093/nar/29.1.22PubMed CentralView ArticlePubMedGoogle Scholar
- Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, et al.: The COG database: an updated version includes eukaryotes. BMC Bioinforma 2003, 4: 41. 10.1186/1471-2105-4-41View ArticleGoogle Scholar
- Kaufmann M: The role of the COG database in comparative and functional genomics. Curr Bioinforma 2006, 1(3):291–300. 10.2174/157489306777828017View ArticleGoogle Scholar
- Makarova KS, Sorokin AV, Novichkov PS, Wolf YI, Koonin EV: Clusters of orthologous genes for 41 archaeal genomes and implications for evolutionary genomics of archaea. Biol Direct 2007, 2: 33. 10.1186/1745-6150-2-33PubMed CentralView ArticlePubMedGoogle Scholar
- Meereis F, Kaufmann M: Extension of the COG and arCOG databases by amino acid and nucleotide sequences. BMC Bioinforma 2008, 9: 479. 10.1186/1471-2105-9-479View ArticleGoogle Scholar
- Alves R, Savageau MA: Evidence of selection for low cognate amino acid bias in amino acid biosynthetic enzymes. Mol Microbiol 2005, 56(4):1017–1034. 10.1111/j.1365-2958.2005.04566.xPubMed CentralView ArticlePubMedGoogle Scholar
- Perlstein EO, de Bivort BL, Kunes S, Schreiber SL: Evolutionarily conserved optimization of amino acid biosynthesis. J Mol Evol 2007, 65(2):186–196. 10.1007/s00239-007-0013-xView ArticlePubMedGoogle Scholar
- Federhen S: The NCBI Taxonomy database. Nucleic Acids Res 2012, 40(Database issue):D136-D143.PubMed CentralView ArticlePubMedGoogle Scholar
- Klein DJ, Moore PB, Steitz TA: The roles of ribosomal proteins in the structure assembly, and evolution of the large ribosomal subunit. J Mol Biol 2004, 340(1):141–177. 10.1016/j.jmb.2004.03.076View ArticlePubMedGoogle Scholar
- Farias ST, Bonato MC: Preferred amino acids and thermostability. Genet Mol Res 2003, 2(4):383–393.PubMedGoogle Scholar
- Van der Linden MG, de Farias ST: Correlation between codon usage and thermostability. Extremophiles 2006, 10(5):479–481. 10.1007/s00792-006-0533-0View ArticlePubMedGoogle Scholar
- Naya H, Romero H, Zavala A, Alvarez B, Musto H: Aerobiosis increases the genomic guanine plus cytosine content (GC%) in prokaryotes. J Mol Evol 2002, 55(3):260–264. 10.1007/s00239-002-2323-3View ArticlePubMedGoogle Scholar
- de Bivort BL, Perlstein EO, Kunes S, Schreiber SL: Amino acid metabolic origin as an evolutionary influence on protein sequence in yeast. J Mol Evol 2009, 68(5):490–497. 10.1007/s00239-009-9218-5PubMed CentralView ArticlePubMedGoogle Scholar
- Alexeyenko A, Lindberg J, PÃ©rez-Bercoff Ã, Sonnhammer E: Overview and comparison of ortholog databases. Drug Discovery Today: Technologies 2006, 3(2):137–143. 10.1016/j.ddtec.2006.06.002View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.