E-CAI: a novel server to estimate an expected value of Codon Adaptation Index (eCAI)
© Puigbò et al; licensee BioMed Central Ltd. 2008
Received: 30 March 2007
Accepted: 29 January 2008
Published: 29 January 2008
The Codon Adaptation Index (CAI) is a measure of the synonymous codon usage bias for a DNA or RNA sequence. It quantifies the similarity between the synonymous codon usage of a gene and the synonymous codon frequency of a reference set. Extreme values in the nucleotide or in the amino acid composition have a large impact on differential preference for synonymous codons. It is thence essential to define the limits for the expected value of CAI on the basis of sequence composition in order to properly interpret the CAI and provide statistical support to CAI analyses. Though several freely available programs calculate the CAI for a given DNA sequence, none of them corrects for compositional biases or provides confidence intervals for CAI values.
The E-CAI server, available at http://genomes.urv.es/CAIcal/E-CAI, is a web-application that calculates an expected value of CAI for a set of query sequences by generating random sequences with G+C and amino acid content similar to those of the input. An executable file, a tutorial, a Frequently Asked Questions (FAQ) section and several examples are also available. To exemplify the use of the E-CAI server, we have analysed the codon adaptation of human mitochondrial genes that codify a subunit of the mitochondrial respiratory chain (excluding those genes that lack a prokaryotic orthologue) and are encoded in the nuclear genome. It is assumed that these genes were transferred from the proto-mitochondrial to the nuclear genome and that its codon usage was then ameliorated.
The E-CAI server provides a direct threshold value for discerning whether the differences in CAI are statistically significant or whether they are merely artifacts that arise from internal biases in the G+C composition and/or amino acid composition of the query sequences.
The Codon Adaptation Index (CAI), introduced by Sharp and Li , is a measure of the synonymous codon usage bias for a DNA or RNA sequence and measures the resemblance between the synonymous codon usage of a gene and the synonymous codon frequencies of a reference set. The CAI index ranges from zero to one being one if a gene always uses, for each encoded amino acid, the most frequently used synonymous codon in the reference set. Though it was originally developed to assess how effective selection has been at moulding the pattern of codon usage , it has since been applied to problems such as predicting the expression level of a gene , predicting a group of highly expressed genes [3, 4], assessing the adaptation of viral genes to their hosts , giving an approximate indication of the likely success of heterologous gene expression , making comparisons of codon usage preferences in different organisms , identifying horizontally transferred genes [6–8], detecting dominating synonymous genomic codon usage bias in genomes , acquiring new knowledge about species lifestyle , and identifying the causes of protein rate variation [11, 12].
Since the absolute value of the CAI depends on the query sequence and on the reference set, both of these parameters are important for correctly interpreting CAI values. On the one hand, if the reference set has a random synonymous codon usage with few differences in the use of synonymous codons, the CAI values will be high, i.e. close to one. On the other hand, extreme G+C and/or amino acid compositions on the query sequence may lead to extreme CAI values that are not directly linked to codon usage preferences. It is therefore essential to define a threshold level for the expected CAI value (eCAI) in order to interpret the significance of codon usage biases and to provide statistical support to CAI analyses. The eCAI estimated by our server makes it possible to discern whether differences in the CAI are statistically significant or whether they cannot be distinguished from biases due to nucleotide or amino acid composition. Although several authors have used some kind of expected codon usage [13, 14], there is no server or program available to estimate it.
The E-CAI server uses a novel algorithm that calculates an expected CAI for a set of query sequences by generating random sequences with similar G+C content and amino acid composition to the query sequences. The server, implemented in PHP, is integrated with several tools for the calculation and graphical representation of CAI. CAI value is calculated as Sharp and Li originally defined it  but using the recent computer implementation proposed by Xia . The Perl source code and a graphical interface written in Tcl/Tk, as well as a tutorial, a Frequently Asked Questions (FAQ) section and several examples are available on the server homepage.
Inputs of the server
The basic inputs for calculating the expected CAI value are the query sequences, the codon usage of the reference set and the genetic code used. The query sequences must be DNA or RNA sequences in fasta format. The codon usage of the reference set can be introduced in a variety of formats, including the format of the Codon Usage Database . Optionally, the user can introduce a G+C percentage to generate the random sequences. If this G+C percentage is not introduced, the server uses the G+C percentage from the query sequences.
Generation of the random sequences and estimation of the expected CAI
The method for estimating an expected CAI is based on generating 500 random sequences with the same amino acid composition as the query but with codon usage assigned randomly, either on the basis of the average G+C content of the input, or on the basis of the G+C percentage introduced by the user. Once all random sequences are generated, their CAI values are calculated. The normality of the CAI values of the random generated sequences is assessed with a Kolmogorov-Smirnov Test. An expected CAI value is then estimated using an upper one-sided tolerance interval for a normal distribution and a confidence limit and a percentage of the population (also called coverage) chosen by the user . A tolerance interval is a way to determine a range within which, with some confidence, a specified proportion of a population falls. The eCAI therefore represents the upper limit of the CAI for sequences with a codon usage caused solely by mutational bias. This means that if the CAI value of a gene is bigger than the expected value estimated on composition bias alone, it may be considered evidence of codon usage adaptation or selection. An effective and intuitive way to compare the CAI value of a gene with its expected CAI value is to use that we call the normalised CAI value. This normalised CAI is defined as the quotient between the CAI of a gene and its expected value eCAI.
The E-CAI server allows two methods for generating the random sequences. The first one, called Markov, is a Markov Model of order 0. This means that the probability of finding an amino acid at a specific position is independent of the other amino acid positions. The Markov method generates the random sequences by adding one amino acid each time, using the frequencies of each amino acid in the query sequences and a random number. It chooses a random number in the interval (0,1), sums the fractions of the amino acid composition of the query and assigns as the next amino acid the one that causes the sum to exceed the random number . This process is repeated until the desired length of the sequence is reached. The random sequences are then back-translated to DNA sequences, assigning randomly one of the synonymous codon to each amino acid, either on the basis of the average G+C content of the input or on the basis of the G+C percentage introduced by the user. The second method for generating the random sequences, called Poisson, is based on the assumption that the number of occurrences for each amino acid in a sequence follows a Poisson distribution. The normalised amino acid frequencies in the query sequences multiplied by the length (n) of the generated random sequences are used as the expected numbers of occurrences of each amino acid in the random sequences. These values are used to calculate the probabilities that there were exactly k occurrences of each amino acid in a sequence of length n. From the sum of these probabilities and a random number, the expected number of occurrences for each amino acid in a random sequence is calculated in a similar way to the Markov method. This process is repeated until the desired number of sequences has been generated. Again, the random sequences are then back-translated to DNA sequences by the same method described above. The results generated by the Markov and Poisson methods are comparable, but the Markov method is more precise and the Poisson method is faster. In addition, similar values of eCAI are obtained when the GenRGenS software is used to generate the random sequences .
Interpretation of the results
The reference set used to calculate the CAI is important for the correct interpretation of its meaning. The CAI measures the similarity between the synonymous codon usage of a gene and the synonymous codon frequency of a reference set. If this reference set is a group of highly expressed genes and in the presence of selected codon usage bias, the CAI values can be used to predict the expression level of genes . However, there is an intrinsic weakness in the interpretation of CAI values when used for species with a highly biased base composition . A further problem also may arise when CAI is used in species which do not display a dominant translational bias [9, 20]. Therefore, it is necessary to establish whether highly expressed genes have translationally selected biased codon usage . In this respect, the algorithm E-CAI can successfully overcome the effects of compositional biases when calculating CAI values. If the average codon usage of a genome is used as a reference set, the CAI can be interpreted as a measure of the codon adaptation of a gene in the context of a genome. This information can be used to optimise the expression of a gene in a heterologous expression system . The values of eCAI calculated by the E-CAI server are expected to be over-estimations because the synonymous codon usage of genes is highly influenced by the G+C content at the third codon position and because amino acid usage is also species-specific . The query sequences define both nucleotide and amino acid composition and are therefore important factors in the calculation of eCAI. The expected CAI value could be meaningless if the composition of the query sequences are very heterogeneous. To assess the homogeneity of the sequences in the query set, a Chi-Square test is calculated to test the goodness-of-fit between the amino acid composition or G+C content of each of the query sequences and the average values used to generate the random sequences. The percentage of query sequences that fit the amino acid and/or G+C mean distributions are then shown. If the query sequences are compositionally very heterogeneous, these percentages will be small. In this case we suggest splitting the query sequences into smaller and homogeneous subsets and estimating the eCAI values for each of the subsets separately.
To calculate CAI values for hundreds or thousands of sequences on a whole-genome scale and generate an eCAI, users can download an executable program that automatically performs these calculations. The inputs, methods and outputs of this executable version are the same as those of the web version. However, it enables to choose the length and number of randomly generated sequences. More details about this script and how to use it are found in the tutorial.
Example: The Amelioration of mitochondrial genes encoded in the human nuclear genome
It is widely accepted that mitochondria have their origin in a single event, arising from a bacterial symbiont whose closest contemporary relatives are found within the alfa-proteobacteria [23, 24]. Since its origin, the mitochondrial genome has undergone a streamlining process of genome reduction with intense periods of loss of genes . Nowadays, mitochondrial genomes exhibit a great variation in protein gene content among most major groups of eukaryotes, but only limited variation within large and ancient groups. This suggests a very episodic, punctuated pattern of mitochondrial gene loss over the broad sweep of eukaryotic evolution . Mitochondrial genomes have lost genes that lack a selective pressure for their conservation. This could include genes whose function may no longer be necessary, genes whose function has been superseded by some pre-existing nuclear genes or genes that were originally present in the proto-mitochondria and that have been transferred to the nucleus . The gene content of present mitochondrial genomes varies from 63 protein-coding genes in Reclinomonas americana, a flagellate protozoon, to three genes in other species (see the GOBASE database , which contains information for more than 1500 complete mitochondrial genomes). Mitochondria in vertebrates encode for 13 respiratory-chain proteins and for a minimal set of tRNAs that suffices to translate all codons. However, the vast majority of proteins located in the mitochondria are the product of nuclear genes. These genes are encoded and transcribed in the nucleus, translated in the cytoplasm and the proteins are subsequently vehiculated to the mitochondria. Some of these proteins are orthologous of present prokaryote genes and are thought to be the result of horizontal gene transfer events from the proto-mitochondrial to the nuclear genome. This hypothesis is reinforced by the fact that several of these genes are encoded in the mitochondrial genome in other eukaryotic species .
Analysis of human mitochondrial genes that encode a subunit of complexes I-V of the mitochondrial respiratory chain encoded in the nuclear (a) or mitochondrial (b) genome.
a) Nuclear encoded genes
p = 0.05
p = 0.05
p = 0.05
p = 0.05
b) Mitochondrial encoded genes
CAI hm /eCAI hm
CAI mt /eCAI mt
p = 0.05
p = 0.05
p = 0.05
p = 0.05
The E-CAI server described here provides an expected value of CAI for discerning whether the differences in CAI are statistically significant and arise from the codon preferences or whether they are merely artifacts that arise from internal biases in the G+C composition and/or amino acid composition of the query sequences. Using a normalised CAI value, defined as the quotient between the CAI of a gene and its expected value, is an effective and intuitive way to analyze the codon usage bias of genes and codon usage adaptation.
Availability and requirements
Project name: E-CAI
Project home page: http://genomes.urv.es/CAIcal/E-CAI
Operating system(s): Platform independent
Programming language: PHP
Other requirements: none
Any restrictions to use by non-academics: license needed
This work has been financed by projects BIO2003-07672 and AGL2007-65678/ALI of the Spanish Ministry of Science and Technology. We thank John Bates and Kevin Costello of the Language Service of the Rovira i Virgili University for their help with writing the manuscript and Hervé Philippe for his comments on the manuscript. IGB is funded by the Volkswagen Foundation under the initiative "Evolutionary Biology".
- Sharp PM, Li WH: The codon Adaptation Index – a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res 1987, 15: 1281–1295. 10.1093/nar/15.3.1281PubMed CentralView ArticlePubMedGoogle Scholar
- Goetz RM, Fuglsang A: Correlation of codon bias measures with mRNA levels: analysis of transcriptome data from Escherichia coli. Biochem Biophys Res Commun 2005, 327: 4–7. 10.1016/j.bbrc.2004.11.134View ArticlePubMedGoogle Scholar
- Wu G, Culley DE, Zhang W: Predicted highly expressed genes in the genomes of Streptomyces coelicolor and Streptomyces avermitilis and the implications for their metabolism. Microbiology 2005, 151: 2175–2187. 10.1099/mic.0.27833-0View ArticlePubMedGoogle Scholar
- Wu G, Nie L, Zhang W: Predicted highly expressed genes in Nocardia farcinica and the implication for its primary metabolism and nocardial virulence. Antonie Van Leeuwenhoek 2006, 89: 135–146. 10.1007/s10482-005-9016-zView ArticlePubMedGoogle Scholar
- Puigbo P, Guzman E, Romeu A, Garcia-Vallve S: OPTIMIZER: A web server for optimizing the codon usage of DNA sequences. Nucleic Acids Res 2007, 35: W126-W131. 10.1093/nar/gkm219PubMed CentralView ArticlePubMedGoogle Scholar
- Lawrence JG, Ochman H: Molecular archaeology of the Escherichia coli genome. Proc Natl Acad Sci USA 1998, 95: 9413–9417. 10.1073/pnas.95.16.9413PubMed CentralView ArticlePubMedGoogle Scholar
- Garcia-Vallve S, Palau J, Romeu A: Horizontal gene transfer in glycosyl hydrolases inferred from codon usage in Escherichia coli and Bacillus subtilis. Mol Biol Evol 1999, 16: 1125–1134.View ArticlePubMedGoogle Scholar
- Garcia-Vallve S, Guzman E, Montero MA, Romeu A: HGT-DB: a database of putative horizontally transferred genes in prokaryotic complete genomes. Nucleic Acids Res 2003, 31: 187–189. 10.1093/nar/gkg004PubMed CentralView ArticlePubMedGoogle Scholar
- Carbone A, Zinovyev A, Kepes F: Codon adaptation index as a measure of dominating codon bias. Bioinformatics 2003, 19: 2005–2015. 10.1093/bioinformatics/btg272View ArticlePubMedGoogle Scholar
- Willenbrock H, Friis C, Juncker AS, Ussery DW: An environmental signature for 323 microbial genomes based on codon adaptation indices. Genome Biol 2006, 7: R114. 10.1186/gb-2006-7-12-r114PubMed CentralView ArticlePubMedGoogle Scholar
- Drummond DA, Raval A, Wilke CO: A single determinant dominates the rate of yeast protein evolution. Mol Biol Evol 2006, 23: 327–337. 10.1093/molbev/msj038View ArticlePubMedGoogle Scholar
- McInerney JO: The causes of protein evolutionary rate variation. Trends Ecol Evol 2006, 21: 230–232. 10.1016/j.tree.2006.03.008View ArticlePubMedGoogle Scholar
- Morton BR: Selection on the codon bias of chloroplast and cyanelle genes in different plant and algal lineages. J Mol Evol 1998, 46: 449–459. 10.1007/PL00006325View ArticlePubMedGoogle Scholar
- Supek F, Vlahovicek K: Comparison of codon usage measures and their applicability in prediction of microbial gene expressivity. BMC Bioinformatics 2005, 6: 182. 10.1186/1471-2105-6-182PubMed CentralView ArticlePubMedGoogle Scholar
- Xia X: An improved implementation of Codon Adaptation Index. Evolutionary Bioinformatics 2007, 3: 53–58.Google Scholar
- Nakamura Y, Gojobori T, Ikemura T: Codon usage tabulated from international DNA sequence databases: status for the year 2000. Nucleic Acids Res 2000, 28: 292. 10.1093/nar/28.1.292PubMed CentralView ArticlePubMedGoogle Scholar
- Hahn GJ, Meeker WQ: Statistical intervals: a guide for practitioners. New York: Wiley; 1991.View ArticleGoogle Scholar
- Fitch WM: Random sequences. J Mol Biol 1983, 163: 171–176. 10.1016/0022-2836(83)90002-5View ArticlePubMedGoogle Scholar
- Ponty Y, Termier M, Denise A: GenRGenS: software for generating random genomic sequences and structures. Bioinformatics 2006, 22: 1534–1535. 10.1093/bioinformatics/btl113View ArticlePubMedGoogle Scholar
- Henry I, Sharp PM: Predicting gene expression level from codon usage bias. Mol Biol Evol 2007, 24: 10–12. 10.1093/molbev/msl148View ArticlePubMedGoogle Scholar
- Grocock RJ, Sharp PM: Synonymous codon usage in Pseudomonas aeruginosa PA01. Gene 2002, 289: 131–139. 10.1016/S0378-1119(02)00503-6View ArticlePubMedGoogle Scholar
- Pasamontes A, Garcia-Vallve S: Use of a multi-way method to analyze the amino acid composition of a conserved group of orthologous proteins in prokaryotes. BMC Bioinformatics 2006, 7: 257. 10.1186/1471-2105-7-257PubMed CentralView ArticlePubMedGoogle Scholar
- Burger G, Gray MW, Lang BF: Mitochondrial genomes: anything goes. Trends Genet 2003, 19: 709–716. 10.1016/j.tig.2003.10.012View ArticlePubMedGoogle Scholar
- Gray MW, Burger G, Lang BF: Mitochondrial evolution. Science 1999, 283: 1476–1481. 10.1126/science.283.5407.1476View ArticlePubMedGoogle Scholar
- Gray MW, Burger G, Lang BF: The origin and early evolution of mitochondria. Genome Biol 2001, 2: REVIEWS1018.. 10.1186/gb-2001-2-6-reviews1018PubMed CentralView ArticlePubMedGoogle Scholar
- Adams KL, Palmer JD: Evolution of mitochondrial gene content: gene loss and transfer to the nucleus. Mol Phylogenet Evol 2003, 29: 380–395. 10.1016/S1055-7903(03)00194-5View ArticlePubMedGoogle Scholar
- O'Brien EA, Zhang Y, Yang L, Wang E, Marie V, Lang BF, Burger G: GOBASE – a database of organelle and bacterial genome information. Nucleic Acids Res 2006, 34: D697–699. 10.1093/nar/gkj098PubMed CentralView ArticlePubMedGoogle Scholar
- Gabaldon T, Huynen MA: Shaping the mitochondrial proteome. Biochim Biophys Acta 2004, 1659: 212–220. 10.1016/j.bbabio.2004.07.011View ArticlePubMedGoogle Scholar
- Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, Stange-Thomann N, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sulston J, Ainscough R, Beck S, Bentley D, Burton J, Clee C, Carter N, Coulson A, Deadman R, Deloukas P, Dunham A, Dunham I, Durbin R, French L, Grafham D, Gregory S, Hubbard T, Humphray S, Hunt A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S, Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S, Waterston RH, Wilson RK, Hillier LW, McPherson JD, Marra MA, Mardis ER, Fulton LA, Chinwalla AT, Pepin KH, Gish WR, Chissoe SL, Wendl MC, Delehaunty KD, Miner TL, Delehaunty A, Kramer JB, Cook LL, Fulton RS, Johnson DL, Minx PJ, Clifton SW, Hawkins T, Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, Doggett N, Cheng JF, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M, Gibbs RA, Muzny DM, Scherer SE, Bouck JB, Sodergren EJ, Worley KC, Rives CM, Gorrell JH, Metzker ML, Naylor SL, Kucherlapati RS, Nelson DL, Weinstock GM, Sakaki Y, Fujiyama A, Hattori M, Yada T, Toyoda A, Itoh T, Kawagoe C, Watanabe H, Totoki Y, Taylor T, Weissenbach J, Heilig R, Saurin W, Artiguenave F, Brottier P, Bruls T, Pelletier E, Robert C, Wincker P, Smith DR, Doucette-Stamm L, Rubenfield M, Weinstock K, Lee HM, Dubois J, Rosenthal A, Platzer M, Nyakatura G, Taudien S, Rump A, Yang H, Yu J, Wang J, Huang G, Gu J, Hood L, Rowen L, Madan A, Qin S, Davis RW, Federspiel NA, Abola AP, Proctor MJ, Myers RM, Schmutz J, Dickson M, Grimwood J, Cox DR, Olson MV, Kaul R, Raymond C, Shimizu N, Kawasaki K, Minoshima S, Evans GA, Athanasiou M, Schultz R, Roe BA, Chen F, Pan H, Ramser J, Lehrach H, Reinhardt R, McCombie WR, de la Bastide M, Dedhia N, Blöcker H, Hornischer K, Nordsiek G, Agarwala R, Aravind L, Bailey JA, Bateman A, Batzoglou S, Birney E, Bork P, Brown DG, Burge CB, Cerutti L, Chen HC, Church D, Clamp M, Copley RR, Doerks T, Eddy SR, Eichler EE, Furey TS, Galagan J, Gilbert JG, Harmon C, Hayashizaki Y, Haussler D, Hermjakob H, Hokamp K, Jang W, Johnson LS, Jones TA, Kasif S, Kaspryzk A, Kennedy S, Kent WJ, Kitts P, Koonin EV, Korf I, Kulp D, Lancet D, Lowe TM, McLysaght A, Mikkelsen T, Moran JV, Mulder N, Pollara VJ, Ponting CP, Schuler G, Schultz J, Slater G, Smit AF, Stupka E, Szustakowski J, Thierry-Mieg D, Thierry-Mieg J, Wagner L, Wallis J, Wheeler R, Williams A, Wolf YI, Wolfe KH, Yang SP, Yeh RF, Collins F, Guyer MS, Peterson J, Felsenfeld A, Wetterstrand KA, Patrinos A, Morgan MJ, de Jong P, Catanese JJ, Osoegawa K, Shizuya H, Choi S, Chen YJ, International Human Genome Sequencing Consortium: Initial sequencing and analysis of the human genome. Nature 2001, 409: 860–921. 10.1038/35057062View ArticlePubMedGoogle Scholar
- Lavner Y, Kotlar D: Codon bias as a factor in regulating expression via translation rate in the human genome. Gene 2005, 345: 127–138. 10.1016/j.gene.2004.11.035View ArticlePubMedGoogle Scholar
- Kotlar D, Lavner Y: The action of selection on codon bias in the human genome is related to frequency, complexity, and chronology of amino acids. BMC Genomics 2006, 7: 67. 10.1186/1471-2164-7-67PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.