Comparison of codon usage measures and their applicability in prediction of microbial gene expressivity
© Supek and Vlahoviček; licensee BioMed Central Ltd. 2005
Received: 11 February 2005
Accepted: 19 July 2005
Published: 19 July 2005
The Erratum to this article has been published in BMC Bioinformatics 2010 11:463
There are a number of methods (also called: measures) currently in use that quantify codon usage in genes. These measures are often influenced by other sequence properties, such as length. This can introduce strong methodological bias into measurements; therefore we attempted to develop a method free from such dependencies. One of the common applications of codon usage analyses is to quantitatively predict gene expressivity.
We compared the performance of several commonly used measures and a novel method we introduce in this paper – Measure Independent of Length and Composition (MILC). Large, randomly generated sequence sets were used to test for dependence on (i) sequence length, (ii) overall amount of codon bias and (iii) codon bias discrepancy in the sequences. A derivative of the method, named MELP (MILC-based Expression Level Predictor) can be used to quantitatively predict gene expression levels from genomic data. It was compared to other similar predictors by examining their correlation with actual, experimentally obtained mRNA or protein abundances.
We have established that MILC is a generally applicable measure, being resistant to changes in gene length and overall nucleotide composition, and introducing little noise into measurements. Other methods, however, may also be appropriate in certain applications. Our efforts to quantitatively predict gene expression levels in several prokaryotes and unicellular eukaryotes met with varying levels of success, depending on the experimental dataset and predictor used. Out of all methods, MELP and Rainer Merkl's GCB method had the most consistent behaviour. A 'reference set' containing known ribosomal protein genes appears to be a valid starting point for a codon usage-based expressivity prediction.
As the numbers of sequenced genes grew, it became evident that synonymous codons are not used equally [1–3]. Codon frequencies were found to vary on 3 levels: between genomes, between genes in the same genome, and within a single gene . Many factors have been shown to influence codon usage patterns, the most important being: (i) overall nucleotide composition of the genome, reflecting mutational biases; (ii) selective forces acting on highly expressed genes to improve efficiency of translation ; and (iii) horizontal gene transfer, with transferred genes retaining the codon frequencies of their former host . Connections have also been demonstrated between codon usage and: (i) gene length ; (ii) location on the chromosome ; (iii) the strand it resides on ; (iv) need for specific secondary structures in mRNA ; and (v) characteristics of the gene's protein product, such as its hydrophobicity  or secondary structure elements .
Moreover, the relative influence of each of these factors varies from genome to genome, and from gene to gene. For example, selection for translation efficiency shapes codon usage more in fast-growing microbes  than in slow-growing ones . In contrast, codon usage of human genes depends largely on GC richness of the chromosomal region (isochore) . It is still unclear to what extent other elements contribute to the genes' codon usage patterns . The multitude of influences on codon preferences, as well as high dimensionality of codon usage data, necessitated the development of various measures (also called: statistics) of codon usage.
Many researchers in this field formulated their own measures, which led to a large number of available methods [17, 18] for codon usage analysis. Unfortunately, these methods are not universally applicable, as their behaviour tends to be context-dependant. They may exhibit strong artefacts with varying (i) sequence length, (ii) overall amount of codon bias and (iii) codon bias discrepancy (see Results and Discussion for an explanation). Previous works [19, 20] discussed this issue and compared some of the commonly used measures available at the time. Our aim was to develop and test a measure that would be free from dependence on the aforementioned contexts. Also, we attempted to verify the usefulness of such a measure by employing it to predict gene expressivity in microbial genomes.
Results & discussion
The "Measure Independent of Length and Composition" (MILC)
Our primary motivation in developing this novel method was to correct for possible artefacts due to sequence length variability. The measure should be able to quantify the distance in codon usage between a gene and some expected distribution of codons. The codon distribution could either be calculated from the background nucleotide composition, or derived from a single gene or a gene group. Therefore, MILC is conceptually similar to Karlin and Mrazek's B , Novembre's ENC'  or Urrutia and Hurst's MCB method .
Mathematically, the measure is based on a log-likelihood ratio score used in the statistical G-test for goodness-of-fit. This methodology yields numerically similar results to the more commonly used χ2 test, but may hold theoretical advantages over it in statistical analyses . Both of the methods have been used in past examinations of codon usage patterns [24, 25].
The individual contribution M a of each amino acid a to the MILC statistic is calculated as
where O c denotes the actual observed count of the codon c in a gene, and E c stands for the expected count of the same codon. The O c /E c ratio is mathematically equal to, and can be replaced by f c /g c , where f c is the frequency of the codon c in a gene, and g c is the expected frequency of the same codon. The sum of f or g over all codons for each amino acid should equal 1. The total difference in codon usage is then assessed by the following formula:
The sum of contributions of all amino acids (stop codons are excluded from calculation) is divided by L, the gene length in codons, in attempt to compensate for the expected increase with total number of codons. This is analogous to the procedure described in . However, such a „scaled χ2" statistic still depends on gene length , greatly overestimating the overall amount of bias in shorter sequences. The correction factor C in Equation 2 attempts to correct for this overestimation.
The cause for the abovementioned effect are sampling errors: a relatively small number of observations (counted codons) cannot exactly fit the expected distribution, leading to a higher perceived χ2 score. In order to demonstrate the effect, let us presume that the expected codon frequencies for two cysteine codons are g(UGU) = 0.5 and g(UGC) = 0.5; and that our hypothetical gene complies with these codon frequencies. However, a short gene might have only a single codon for Cys, thus the observed counts can be only OUGU = 1 and OUGC = 0, or vice versa. Either way, instead of being equal to 0, the cysteine's contribution to the χ2 score will be:
In case the gene has two cysteines, there is a 50% chance that OUGU = OUGC = 1, which would yield a (correct) χ2 score of 0; and a 50% chance that one of them will be 2, and the other 0, which gives a χ2 score of 2. The weighted average of these scores will again be equal to 1. Moving on to cases with 3, 4 or more cysteines we see that always MCys = 1, and it can be shown that for each amino acid in this case M a is equal to its degree of redundancy minus 1 (e.g. MIle = 2, MPro = 3). In fact, this is the expected value of the χ2 statistic under the null hypothesis (observed frequencies match the expected frequencies), which equals the number of degrees of freedom. The calculation can be generalized to cases when the observed frequencies do not match the expected codon distribution, and is also applicable to the G statistic MILC is based upon. Further examples to better illustrate this point are given in the material accompanying this paper [see Additional file 1].
To reiterate, in a situation where the gene's codon usage matches the expected distribution, with all amino acids present, the sampling errors will increase the χ2 score by 41, and the „scaled χ2" by 41/L. The correction factor C is therefore calculated as:
where r a is the number of possible codons for the amino acid a – its degeneracy class. Only the amino acids actually present at least once in the sequence contribute to C, e.g. if a gene missed one of the four-fold amino acids, C would be 38/L + 0.5. When the observed frequencies match the expected codon distribution closely, MILC can assume negative values. In order to compensate, a constant of 0.5 is added to the correction factor C (see Equation 4). Regarding minimum sequence length, we recommend that only sequences of 80 codons or longer be analysed using MILC (or any other measure of codon usage); many researchers set this threshold to even higher values, such as 100.
Behaviour of codon usage measures under varying conditions
A multitude of methods to measure codon usage has been published, including "scaled χ2" , "effective number of codons" ENC , "codon bias index" CBI , "intrinsic codon bias index" ICDI , two versions of "codon bias" B [21, 29], "maximum likelihood codon bias" MCB , "effective number of codons prime" ENC' , and "synonymous codon bias orderliness" SCUO . Among those, we chose to test the methods that have been either frequently used in codon usage examinations, or that are new and haven't been extensively tested .
ENC is an older, widely accepted measure that quantifies the degree of deviation from equal use of synonymous codons; ENC' gives results comparable to ENC but allows comparison to any desired codon distribution; the 1998 version of Karlin and Mrazek's B has been used extensively in later research of microbial genomes by the same authors; MCB is a method conceptually similar to B, used in examinations of human genes; and SCUO is a representative of the information theory-based measures, which have recently been used on several occasions [31, 32] to analyze codon usage. Finally, the method proposed in this paper, MILC, is compared in performance to the aforementioned methodologies.
Determining the 'dynamic range' for measures of codon usage
B | None
B | None
MCB | None
MCB | None
ENC' | None
ENC' | None
MILC | None
MILC | None
We designed three experiments to determine to what extent changing gene length affects each measure. In the first experiment (Figures 1a and 1b) the expected distribution assumes equal codon frequencies ("None", see Methods) and the generated sets of genes attempt to mimic that distribution. Therefore, the methods should ideally report a minimal distance between the observed and the expected distribution. ENC, ENC', MILC and MCB are generally well behaved under these conditions and tend to somewhat overestimate the amounts of bias in short sequences, MCB overestimates bias also in longer sequences. In contrast, B and SCUO greatly overestimate the bias in shorter genes (by "shorter" we assume a range of gene lengths most frequent in genomes, e.g. 100–500 codons). For example, using B on sequences 250 and 500 codons long would result in the first sequence being seemingly different twice as much from the expected distribution as the second one. Moreover, the overestimation at 250 codons may amount to as much as a quarter of the dynamic range of B. As anticipated, the variability of all measures (Figure 1b) decreases with an increase in gene length. It must be noted that MCB measurements introduce significantly less noise than the rest of the methods, particularly in short genes.
The second experiment, where the overall amount of bias in both the generated sequences and the expected distribution increases (Figure 1c) shows little change regarding length dependence – all methods see a very modest improvement in performance. ENC now tends to slightly underestimate bias, however, the variability chart (Figure 1d) shows that here it becomes noticeably less reliable than other methods, and so does SCUO. MCB is still the best performer, followed by MILC and B for shorter sequences, and ENC' for longer ones.
Figures 1e and 1f, representing the third experiment, demonstrate what happens when a gene unbiased in codon usage differs from the biased expected codon frequencies, derived from the "Med-1" dataset (see Methods). This is, in fact, a situation more likely to occur in real-life applications, as a gene would probably show at least some deviation from the expected codon distribution. ENC and SCUO expectedly behave precisely the same as in 1a and 1b, because they by definition always assume an unbiased expected distribution. Interestingly, B improves significantly and does not feel as much influence of gene length when the observed and expected codon distributions differ. It now performs on par with ENC' and MCB, both of which show a detrimental effect of increasing distance between the observed and the expected distribution. This factor also increases the amount of variation introduced by measures (excluding ENC and SCUO), most of all ENC', and causes MCB to lose its advantage over MILC and B.
Measures of codon usage introduce different levels of statistical bias in shorter genes; however, it must be noted that even if this influence were completely eliminated, there might still exist a connection between codon bias and length caused by the inherent properties of the sequences. Selection might be acting to optimize codon usage patterns (and therefore translational efficiency) in energetically costly longer genes; on the other hand it might also act to reduce the size of highly expressed (and strongly biased) proteins . The only way to nullify these length effects – if this is desired – is to use regression, while employing a length-insensitive measure.
In addition to being resistant to length variation, the methods should ideally be invariant to both overall bias and the relative difference in codon usage. Moreover, the measures should be commutative with respect to properties of the observed and expected distributions. We designed two experiments to investigate these issues.
Furthermore, in order to test the commutative property, using each measure we compared datasets with varying levels of bias to the "None" expected distribution, and vice versa. Theoretically, when using many long sequences, comparing "None" genes to, for instance, "Med-1" expected distribution should yield the same result as comparing "Med-1" genes to the "None" expected distribution. In Figure 3b we show that among the measures that allow comparisons, the only one handling this appropriately was Karlin and Mrazek's B. MILC is less sensitive than ENC' and especially MCB, which displays a polar effect, being more strongly influenced by changes in the overall bias in the expected frequencies.
In genomes, individual amino acids may vary in amount of codon bias, an occurrence termed 'codon bias discrepancy', best described by the phrase "some codons are more optimal than others" in Fuglsang's paper . For instance, in E. coli the CGU and CGC codons for arginine are strongly preferred over the other four codons, while six codons for serine are chosen more uniformly, with a mild preference for AGC over the others.
Improving prediction of microbial gene expressivity
Analogous to Karlin and Mrazek's method of predicting expression levels of genes , we formulate a statistic named MELP (MILC-based Expression Level Predictor), computed simply as the ratio of respective distances of a gene's codon usage from the genomic average, and a predefined reference set:
Transcript/protein abundance data used for validation of expression level predictors
Files / accessions
1.ref-abund.xls, column G
1.ref-abund.xls, column B
Escherichia coli K-12 MG1655
tables A1, A2, A3
columns AB, RIC
columns PHNppm, PSppm, NSppm
minimal (MOPS, glucose)
3181Table6.xls, column D
minimal (MOPS, glucose)
Escherichia coli K-12 W3110
ex298 – ex320, ex328-ex334
ex745 – ex749
ex264, ex265, ex272, ex273, ex275, ex276, ex278 – ex286
ex940 – ex945
Synechocystis sp. PCC6803
ex832 – ex839
low light conditions
ex22, 23, 24, 44
Plasmodium falciparum 3D7
average of 4 life stages
Table_1, columns I, K, Q, AB, AD, AJ, AO, AQ
average of 4 life stages
The agreement of predicted and actual protein/transcript levels varied greatly between all examined combinations of prediction method and dataset. The cause may lie in the quality of experimental data; for instance, mRNA abundances and protein 2D-PAGE data have been shown not to agree well in certain cases ; 2D-PAGE as a method may only be suitable for detection of abundant proteins , while microarray data tends to suffer from noise introduced at each step of different experimental protocols . The other probable reason for relatively incoherent results is that a model for predicting gene expression from genomic data, based solely on codon usage, is oversimplified. Other factors, such as promoter strength and gene copy number should also be taken into account. Fortunately, optimal codon usage in genes seems to coincide with factors enhancing transcription – this is why it is possible to observe a correlation between codon usage (acting at translation level) and transcript abundances. Keeping these limitations in mind, it seems safe to say that, in comparison to other predictors, GCB and MELP behave more consistently throughout all datasets.
Transcript and/or protein levels in a cell are normally subject to regulation, as opposed to codon usage patterns, which are 'hard-coded' in the genome sequence. If we suppose the major force shaping gene-specific codon usage patterns in microbes is selection for translation efficiency, which operates in periods of fast competitive growth, it follows that codon usage will be 'optimised' for genes highly expressed in such periods. For that reason we chose datasets of organisms harvested in exponential growth phase, and without severe nutritional restrictions in the medium. For instance, the Bsu-2 datasets describes Bacillus harvested at OD600 ≅ 0.4 – 0.6; an analogous dataset [see Additional file 1] for bacteria harvested at OD600 ≅ 1.1 does not correlate so well with predicted expression levels (Pearson's correlation coefficient for MELP = 0.234 vs. 0.187, for GCB = 0.277 vs. 0.185). In addition, the growth conditions should match the organism's natural habitat. For instance, E. coli grown in a rich medium has gene expression levels closer to the predicted values than E. coli in a defined medium; should the data in Eco-2 dataset be replaced with data from MOPS+glucose grown cells [see Additional file 1], the Pearson's correlation coefficient for log-transformed data drops from 0.720 to 0.663 (MELP), or from 0.708 to 0.642 (GCB). Furthermore, nitrogen or phosphorus starvation of E. coli in the Eco-3 dataset reduces the correlation with predicted values (data not shown). Such connections between codon usage and gene expression under different conditions can be used to hypothesize about the exact 'natural' environment of a microbe .
Any codon usage-based prediction of gene expression relies on a prior definition of a 'reference set', consisting of highly expressed genes. Our reference sets were defined as all genes coding for ribosomal proteins, longer than 100 codons; other approaches to this issue exist. For instance, the original definition for CAI  listed a set of genes which have been empirically proven to be highly expressed in yeast and E. coli; Karlin and Mrazek  included transcription/translation related factors and chaperones in the reference set, in addition to the ribosomal protein genes; attempts have been made to detect major trends in codon usage by iterative computational methods [38, 43] and use the results to define a reference set. We investigated to what extent reference set composition affects prediction of gene expression; the alternative reference sets used were obtained from Merkl  and generated by computationally detecting the major trend in codon usage in a genome. The sets normally contained ribosomal protein genes, elongation factors and energy metabolism genes; also photosynthesis genes in Synechocystis and histones in P. falciparum; such functional assignments for reference set genes were not unexpected. Under the assumption that the major trend is due to translational selection, the change in reference set composition should have theoretically resulted in improved prediction. However, the outcome was highly dependent on the genome examined, and the predictor used (shown as error bars in Figure 5). In some instances, the use of the alternative reference set resulted in poorer correlation. More high-quality transcript/protein abundance data would be required to reach a definite recommendation on forming a reference set.
We introduce a novel method, based on a corrected log-ratio chi-squared statistic, of measuring codon usage bias in genes or gene groups – MILC. By comparing its performance to other commonly used measures of codon usage in a variety of contexts, we have established that MILC is a generally applicable method, being resistant to changes in gene length and overall nucleotide composition, and introducing little noise into measurements. Other measures, however, may also be appropriate for specific purposes: B, when comparing very long sequences (groups of genes, whole genomes) which are expected to differ significantly in codon usage and/or exhibit bias discrepancy; or MCB, when comparing sequences of varying lengths but relatively similar in codon preferences. We have also evaluated the methods' ability to estimate gene expression levels by comparing them to actual mRNA/protein abundance data from several species. Out of the tested predictors, GCB and MELP exhibit the most consistent behaviour. A reference set defined simply by including ribosomal protein genes appears to be a valid starting point for expression level predictions in examined prokaryotes and unicellular eukaryotes, although one should be cautious when interpreting the results of such estimations. The MILC and MELP methods have been implemented in the version 2 of the INCA software, available from the bioinfo-hr.org website .
Nucleotide composition of the generated sequences at silent sites
Values in Figures 1, 3 and 4 are expressed as percentages of the 'dynamic range' of a method, the largest difference between its high and low values under realistic conditions. This was assessed by comparing, using each method, first a set of 10000 'None' sequences (2500 codons long) to the 'None' frequency table, and then a set of 10000 'High-2' sequences (2500 codons long) to the 'None' frequencies, and finally by subtracting the numbers; this process is summarized in Table 2. Because of this normalization process, positive values of the mean always signify overestimation of bias, even though, for instance, a higher value of ENC' normally means less bias.
The codon frequency tables used to generate sequences, derived from the None, Low, Med and High nucleotide compositions, are available in the accompanying materials [see Additional file 1], as well as the frequency tables used to test for codon usage discrepancy effects.
Predictors of gene expression
The expression level predictors CAI, E, and GCB were computed as in [37, 36] and , respectively. When calculating the 'frequency of optimal codons' Fop, a codon with a relative adaptiveness (codon frequency divided by the frequency of the most frequent codon) larger than 0.9 was considered optimal. Experimental datasets used to investigate the performance of the predictors are listed in Table 1. Datasets Sce-1, 2, 3, and Eco-4 were used 'as-is' from the respective sources. Eco-1 dataset was created by combining molar abundances (column "N-abd") from Tables a1, a2 and a3 in ; if a gene occurred in more than one table, its final abundance value was calculated as an average of the two/three measurements. Eco-2 dataset was created from the E. coli Gene-Protein Database  by multiplying values in the "AB" column (abundances) with values in the "RIC" column (rich media) and dividing by the "MWc" column to obtain molar abundances. Eco-3 dataset was created by averaging the "PHNppm", "PSppm" and "NSppm" (control groups for phosphorus and nitrogen starvation experiments), and by dividing by the "MWc" column. Ecj-6, Bsu-1, 2, 3, Syn-1 and Syn-2 datasets were downloaded from the KEGG expression data repository  and were processed in the following manner: the local background ("Control-bkg") was subtracted from the signal intensity ("Control-sig") for each microarray spot in the control groups, and the resulting values were normalised to the sum of 106 per experiment. Finally, for each spot/gene a median value over all experiments in a dataset was calculated. The Pfa-1 dataset was created by averaging the sequence coverage of a protein over all four life stages; if a protein was not detected in a P. falciparum stage, its sequence coverage was assumed to equal 0. To create the Pfa-2 dataset, the columns I, K, AB and AD were averaged to obtain an mRNA abundance for the trophozoite, Q and AJ for the merozoite; column AO provided values for the gametocyte, and column AQ for the sporozoite. The final abundance values were again obtained by averaging the four life stages. Files containing coding regions of genes were downloaded from the NCBI ftp site  for the Eco, Sce, Pfa and Syn datasets, and from the KEGG ftp site  for the Ecj and Bsu datasets.
FS thanks Rainer Merkl for helpful discussion and data used in manuscript preparation.
- Ikemura T: Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli translational system. J Mol Biol 1981, 151(3):389–409. 10.1016/0022-2836(81)90003-6View ArticlePubMedGoogle Scholar
- Grantham R, Gautier C, Gouy M, Jacobzone M, Mercier R: Codon catalog usage is a genome strategy modulated for gene expressivity. Nucleic Acids Res 1981, 9(1):r43–74. 10.1093/nar/9.1.213-bPubMed CentralView ArticlePubMedGoogle Scholar
- Gouy M, Gautier C: Codon usage in bacteria: correlation with gene expressivity. Nucleic Acids Res 1982, 10(22):7055–7074. 10.1093/nar/10.22.7055PubMed CentralView ArticlePubMedGoogle Scholar
- Hooper SD, Berg OG: Gradients in nucleotide and codon usage along Escherichia coli genes. Nucleic Acids Res 2000, 28(18):3517–3523. 10.1093/nar/28.18.3517PubMed CentralView ArticlePubMedGoogle Scholar
- Ikemura T: Codon usage and tRNA content in unicellular and multicellular organisms. Mol Biol Evol 1985, 2(1):13–34.PubMedGoogle Scholar
- Lawrence JG, Ochman H: Molecular archaeology of the Escherichia coli genome. Proc Natl Acad Sci U S A 1998, 95(16):9413–9417. 10.1073/pnas.95.16.9413PubMed CentralView ArticlePubMedGoogle Scholar
- Moriyama EN, Powell JR: Gene length and codon usage bias in Drosophila melanogaster, Saccharomyces cerevisiae and Escherichia coli. Nucleic Acids Res 1998, 26(13):3188–3193. 10.1093/nar/26.13.3188PubMed CentralView ArticlePubMedGoogle Scholar
- Daubin V, Perriere G: G+C3 structuring along the genome: a common feature in prokaryotes. Mol Biol Evol 2003, 20(4):471–483. 10.1093/molbev/msg022View ArticlePubMedGoogle Scholar
- Lafay B, Lloyd AT, McLean MJ, Devine KM, Sharp PM, Wolfe KH: Proteome composition and codon usage in spirochaetes: species-specific and DNA strand-specific mutational biases. Nucleic Acids Res 1999, 27(7):1642–1649. 10.1093/nar/27.7.1642PubMed CentralView ArticlePubMedGoogle Scholar
- Seffens W, Digby D: mRNAs have greater negative folding free energies than shuffled or codon choice randomized sequences. Nucleic Acids Res 1999, 27(7):1578–1584. 10.1093/nar/27.7.1578PubMed CentralView ArticlePubMedGoogle Scholar
- D'Onofrio G, Jabbari K, Musto H, Bernardi G: The correlation of protein hydropathy with the base composition of coding sequences. Gene 1999, 238(1):3–14. 10.1016/S0378-1119(99)00257-7View ArticlePubMedGoogle Scholar
- Oresic M, Shalloway D: Specific correlations between relative synonymous codon usage and protein secondary structure. J Mol Biol 1998, 281(1):31–48. 10.1006/jmbi.1998.1921View ArticlePubMedGoogle Scholar
- Karlin S, Mrazek J, Campbell A, Kaiser D: Characterizations of highly expressed genes of four fast-growing bacteria. J Bacteriol 2001, 183(17):5025–5040. 10.1128/JB.183.17.5025-5040.2001PubMed CentralView ArticlePubMedGoogle Scholar
- Lafay B, Atherton JC, Sharp PM: Absence of translationally selected synonymous codon usage bias in Helicobacter pylori. Microbiology 2000, 146 ( Pt 4): 851–860.View ArticleGoogle Scholar
- Sharp PM, Averof M, Lloyd AT, Matassi G, Peden JF: DNA sequence evolution: the sounds of silence. Philos Trans R Soc Lond B Biol Sci 1995, 349(1329):241–247. 10.1098/rstb.1995.0108View ArticlePubMedGoogle Scholar
- Urrutia AO, Hurst LD: The signature of selection mediated by expression on human genes. Genome Res 2003, 13(10):2260–2264. 10.1101/gr.641103PubMed CentralView ArticlePubMedGoogle Scholar
- Moriyama EN: Encyclopedia of the Human Genome: Codon Usage.[http://www.ehgonline.net]
- Ermolaeva MD: Synonymous codon usage in bacteria. Curr Issues Mol Biol 2001, 3(4):91–97.PubMedGoogle Scholar
- Novembre JA: Accounting for background nucleotide composition when measuring codon usage bias. Mol Biol Evol 2002, 19(8):1390–1394.View ArticlePubMedGoogle Scholar
- Comeron JM, Aguade M: An evaluation of measures of synonymous codon usage bias. J Mol Evol 1998, 47(3):268–274. 10.1007/PL00006384View ArticlePubMedGoogle Scholar
- Karlin S, Mrazek J, Campbell AM: Codon usages in different gene classes of the Escherichia coli genome. Mol Microbiol 1998, 29(6):1341–1355. 10.1046/j.1365-2958.1998.01008.xView ArticlePubMedGoogle Scholar
- Urrutia AO, Hurst LD: Codon usage bias covaries with expression breadth and the rate of synonymous evolution in humans, but this is not evidence for selection. Genetics 2001, 159(3):1191–1199.PubMed CentralPubMedGoogle Scholar
- Rohlf FJ, Sokal RR: Biometry. W. H. Freeman; 1994.Google Scholar
- Sharp PM, Tuohy TM, Mosurski KR: Codon usage in yeast: cluster analysis clearly differentiates highly and lowly expressed genes. Nucleic Acids Res 1986, 14(13):5125–5143. 10.1093/nar/14.13.5125PubMed CentralView ArticlePubMedGoogle Scholar
- Shields DC, Sharp PM: Synonymous codon usage in Bacillus subtilis reflects both translational selection and mutational biases. Nucleic Acids Res 1987, 15(19):8023–8040. 10.1093/nar/15.19.8023PubMed CentralView ArticlePubMedGoogle Scholar
- Wright F: The 'effective number of codons' used in a gene. Gene 1990, 87(1):23–29. 10.1016/0378-1119(90)90491-9View ArticlePubMedGoogle Scholar
- Morton BR: Codon use and the rate of divergence of land plant chloroplast genes. Mol Biol Evol 1994, 11(2):231–238.PubMedGoogle Scholar
- Freire-Picos MA, Gonzalez-Siso MI, Rodriguez-Belmonte E, Rodriguez-Torres AM, Ramil E, Cerdan ME: Codon usage in Kluyveromyces lactis and in yeast cytochrome c-encoding genes. Gene 1994, 139(1):43–49. 10.1016/0378-1119(94)90521-5View ArticlePubMedGoogle Scholar
- Karlin S, Mrazek J: What drives codon choices in human genes? J Mol Biol 1996, 262(4):459–472. 10.1006/jmbi.1996.0528View ArticlePubMedGoogle Scholar
- Wan XF, Xu D, Kleinhofs A, Zhou J: Quantitative relationship between synonymous codon usage bias and GC composition across unicellular genomes. BMC Evol Biol 2004, 4(1):19. 10.1186/1471-2148-4-19PubMed CentralView ArticlePubMedGoogle Scholar
- Wang HC, Badger J, Kearney P, Li M: Analysis of codon usage patterns of bacterial genomes using the self-organizing map. Mol Biol Evol 2001, 18(5):792–800.View ArticlePubMedGoogle Scholar
- Zeeberg B: Shannon information theoretic computation of synonymous codon usage biases in coding regions of human and mouse genomes. Genome Res 2002, 12(6):944–955. 10.1101/gr.213402PubMed CentralView ArticlePubMedGoogle Scholar
- Supek F, Vlahovicek K: INCA: synonymous codon usage analysis and clustering by means of self-organizing map. Bioinformatics 2004, 20(14):2329–2330. 10.1093/bioinformatics/bth238View ArticlePubMedGoogle Scholar
- Fuglsang A: The effective number of codons for individual amino acids: some codons are more optimal than others. Gene 2003, 320: 185–190. 10.1016/S0378-1119(03)00829-1View ArticlePubMedGoogle Scholar
- Fuglsang A: The 'effective number of codons' revisited. Biochem Biophys Res Commun 2004, 317(3):957–964. 10.1016/j.bbrc.2004.03.138View ArticlePubMedGoogle Scholar
- Karlin S, Mrazek J: Predicted highly expressed genes of diverse prokaryotic genomes. J Bacteriol 2000, 182(18):5238–5250. 10.1128/JB.182.18.5238-5250.2000PubMed CentralView ArticlePubMedGoogle Scholar
- Sharp PM, Li WH: The codon Adaptation Index--a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res 1987, 15(3):1281–1295. 10.1093/nar/15.3.1281PubMed CentralView ArticlePubMedGoogle Scholar
- Merkl R: A survey of codon and amino acid frequency bias in microbial genomes focusing on translational efficiency. J Mol Evol 2003, 57(4):453–466. 10.1007/s00239-003-2499-1View ArticlePubMedGoogle Scholar
- Gygi SP, Rochon Y, Franza BR, Aebersold R: Correlation between protein and mRNA abundance in yeast. Mol Cell Biol 1999, 19(3):1720–1730.PubMed CentralPubMedGoogle Scholar
- Gygi SP, Corthals GL, Zhang Y, Rochon Y, Aebersold R: Evaluation of two-dimensional gel electrophoresis-based proteome analysis technology. Proc Natl Acad Sci U S A 2000, 97(17):9390–9395. 10.1073/pnas.160270797PubMed CentralView ArticlePubMedGoogle Scholar
- Schuchhardt J, Beule D, Malik A, Wolski E, Eickhoff H, Lehrach H, Herzel H: Normalization strategies for cDNA microarrays. Nucleic Acids Res 2000, 28(10):E47. 10.1093/nar/28.10.e47PubMed CentralView ArticlePubMedGoogle Scholar
- Wagner A: Inferring lifestyle from gene expression patterns. Mol Biol Evol 2000, 17(12):1985–1987.View ArticlePubMedGoogle Scholar
- Jansen R, Bussemaker HJ, Gerstein M: Revisiting the codon adaptation index from a whole-genome perspective: analyzing the relationship between gene expression and codon occurrence in yeast using a variety of models. Nucleic Acids Res 2003, 31(8):2242–2251. 10.1093/nar/gkg306PubMed CentralView ArticlePubMedGoogle Scholar
- Merkl R: Personal communication. 2004.Google Scholar
- Bioinfo-hr.org website[http://www.bioinfo-hr.org/inca]
- Link AJ, Robison K, Church GM: Comparing the predicted and observed properties of proteins encoded in the genome of Escherichia coli K-12. Electrophoresis 1997, 18(8):1259–1313. 10.1002/elps.1150180807View ArticlePubMedGoogle Scholar
- VanBogelen RA, Abshire KZ, Moldover B, Olson ER, Neidhardt FC: Escherichia coli proteome analysis using the gene-protein database. Electrophoresis 1997, 18(8):1243–1251. 10.1002/elps.1150180805View ArticlePubMedGoogle Scholar
- Nakao M, Bono H, Kawashima S, Kamiya T, Sato K, Goto S, Kanehisa M: Genome-scale Gene Expression Analysis and Pathway Reconstruction in KEGG. Genome Inform Ser Workshop Genome Inform 1999, 10: 94–103.PubMedGoogle Scholar
- NCBI Genomes FTP site[ftp://ftp.ncbi.nlm.nih.gov/genomes/]
- KEGG Genomes FTP site[ftp://ftp.genome.jp/kegg/genomes/genes]
- Greenbaum D, Colangelo C, Williams K, Gerstein M: Comparing protein abundance and mRNA expression levels on a genomic scale. Genome Biol 2003, 4(9):117. 10.1186/gb-2003-4-9-117PubMed CentralView ArticlePubMedGoogle Scholar
- Ghaemmaghami S, Huh WK, Bower K, Howson RW, Belle A, Dephoure N, O'Shea EK, Weissman JS: Global analysis of protein expression in yeast. Nature 2003, 425(6959):737–741. 10.1038/nature02046View ArticlePubMedGoogle Scholar
- Bernstein JA, Khodursky AB, Lin PH, Lin-Chao S, Cohen SN: Global analysis of mRNA decay and abundance in Escherichia coli at single-gene resolution using two-color fluorescent DNA microarrays. Proc Natl Acad Sci U S A 2002, 99(15):9697–9702. 10.1073/pnas.112318199PubMed CentralView ArticlePubMedGoogle Scholar
- Allen TE, Herrgard MJ, Liu M, Qiu Y, Glasner JD, Blattner FR, Palsson BO: Genome-scale analysis of the uses of the Escherichia coli genome: model-driven analysis of heterogeneous data sets. J Bacteriol 2003, 185(21):6392–6399. 10.1128/JB.185.21.6392-6399.2003PubMed CentralView ArticlePubMedGoogle Scholar
- Mori H, Horiuchi T, Isono K, Wada C, Kanaya S, Kitagawa M, Ara T, Ohshima H: [Post sequence genome analysis of Escherichia coli]. Tanpakushitsu Kakusan Koso 2001, 46(13):1977–1985.PubMedGoogle Scholar
- Asai K, Yamaguchi H, Kang CM, Yoshida K, Fujita Y, Sadaie Y: DNA microarray analysis of Bacillus subtilis sigma factors of extracytoplasmic function family. FEMS Microbiol Lett 2003, 220(1):155–160. 10.1016/S0378-1097(03)00093-4View ArticlePubMedGoogle Scholar
- Kobayashi K, Ogura M, Yamaguchi H, Yoshida K, Ogasawara N, Tanaka T, Fujita Y: Comprehensive DNA microarray analysis of Bacillus subtilis two-component regulatory systems. J Bacteriol 2001, 183(24):7365–7370. 10.1128/JB.183.24.7365-7370.2001PubMed CentralView ArticlePubMedGoogle Scholar
- Serizawa M, Yamamoto H, Yamaguchi H, Fujita Y, Kobayashi K, Ogasawara N, Sekiguchi J: Systematic analysis of SigD-regulated genes in Bacillus subtilis by DNA microarray and Northern blotting analyses. Gene 2004, 329: 125–136. 10.1016/j.gene.2003.12.024View ArticlePubMedGoogle Scholar
- Hihara Y, Sonoike K, Kanehisa M, Ikeuchi M: DNA microarray analysis of redox-responsive genes in the genome of the cyanobacterium Synechocystis sp. strain PCC 6803. J Bacteriol 2003, 185(5):1719–1725. 10.1128/JB.185.5.1719-1725.2003PubMed CentralView ArticlePubMedGoogle Scholar
- Yoshimura H, Yanagisawa S, Kanehisa M, Ohmori M: Screening for the target gene of cyanobacterial cAMP receptor protein SYCRP1. Mol Microbiol 2002, 43(4):843–853. 10.1046/j.1365-2958.2002.02790.xView ArticlePubMedGoogle Scholar
- Florens L, Washburn MP, Raine JD, Anthony RM, Grainger M, Haynes JD, Moch JK, Muster N, Sacci JB, Tabb DL, Witney AA, Wolters D, Wu Y, Gardner MJ, Holder AA, Sinden RE, Yates JR, Carucci DJ: A proteomic view of the Plasmodium falciparum life cycle. Nature 2002, 419(6906):520–526. 10.1038/nature01107View ArticlePubMedGoogle Scholar
- Le Roch KG, Zhou Y, Blair PL, Grainger M, Moch JK, Haynes JD, De La Vega P, Holder AA, Batalov S, Carucci DJ, Winzeler EA: Discovery of gene function by expression profiling of the malaria parasite life cycle. Science 2003, 301(5639):1503–1508. 10.1126/science.1087025View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.