Exploration of multivariate analysis in microbial coding sequence modeling
© Mehmood et al.; licensee BioMed Central Ltd. 2012
Received: 27 February 2012
Accepted: 24 April 2012
Published: 14 May 2012
Gene finding is a complicated procedure that encapsulates algorithms for coding sequence modeling, identification of promoter regions, issues concerning overlapping genes and more. In the present study we focus on coding sequence modeling algorithms; that is, algorithms for identification and prediction of the actual coding sequences from genomic DNA. In this respect, we promote a novel multivariate method known as Canonical Powered Partial Least Squares (CPPLS) as an alternative to the commonly used Interpolated Markov model (IMM). Comparisons between the methods were performed on DNA, codon and protein sequences with highly conserved genes taken from several species with different genomic properties.
The multivariate CPPLS approach classified coding sequence substantially better than the commonly used IMM on the same set of sequences. We also found that the use of CPPLS with codon representation gave significantly better classification results than both IMM with protein (p < 0.001) and with DNA (p < 0.001). Further, although the mean performance was similar, the variation of CPPLS performance on codon representation was significantly smaller than for IMM (p < 0.001).
The performance of coding sequence modeling can be substantially improved by using an algorithm based on the multivariate CPPLS method applied to codon or DNA frequencies.
For each sequenced genome, the basic step of annotation is the prediction of genes. In prokaryotes, an average of over 80% of the genome consists of genes which are mostly protein coding, meaning that correct identification of protein coding genes is a key aim in computational biology. A complicating factor is that a fraction of microbial genomes consist of degenerated genes with no remaining functionality. A gene finder must therefore be a rather complex ’engine’ capable of distinguishing real protein-coding genes from DNA sequence regions consisting of degenerated genes, non-coding regions and more. To map genes, gene finders typically identify a set of gene-candidates commonly referred to as open reading frames (ORFs). The number of ORFs found by gene finders is typically large compared to the true number of genes. To reduce the number of ORFs and minimize the false predictions of real protein-coding genes, a gene finder must take into account several genomic properties like the existence of upstream regulatory sequences (ribosomal binding sites, promoter regions, etc.), degree and type of overlap between open reading frames, as well as the content of the coding genes themselves. Considering the above mentioned properties, a gene finding procedure can be sketched as follows: 1) identify all possible ORFs in the genome, 2) score all ORFs by various criteria, e.g. their length, their base composition, their upstream sequence, their overlap with other ORFs, etc. 3) classify ORFs as coding genes or non-coding regions based on the scores achieved in the previous steps.
Although the performance of prokaryotic gene finders is relatively good compared to eukaryotic gene finders[3, 4] there is room for improvement. Prokaryotic gene finders have a tendency to be biased towards identifying false positive ORFs. Short genes are difficult to identify correctly, and genes in GC rich genomes are challenging to predict accurately[6–8]. It is therefore important that proper algorithms for coding sequence modeling are implemented in gene finders. Algorithms used by gene finders should have the ability to extract sequence parameters from coding sequence modeling of putative genes (often referred to as training), and then classify new genes as ORFs based on similarity to the estimated sequence parameters. Several popular gene finders use models based on some sort of Markov chain methodology to identify ORFs[10–15]. Markov chain based models are ”trained” on a set of sequences (typically nucleotide, protein or codon frequencies) and use the statistical parameters extracted from this training to classify new sequences. The training procedure in Glimmer3, which is a Markov chain-type model, identifies long ORFs from DNA sequences which are used to build the Interpolated Context Model. The Interpolated Context Model (IMM) is then used to classify ORFs in DNA sequences having similar characteristics to the training data sequences. This means that the classification power of gene finders based on training relies heavily on the properties of the sequence data used. Thus, for gene finders, it is important that the sequence data used for training has as many general characteristics of genes as possible, which emphasizes the relevance of procedures that facilitates sequence data for accurate gene prediction. To obtain sequence data that may have such characteristics we turn to pangenomics. The re-sequencing of multiple strains within the same species or phylotype has resulted in the study of microbial pangenomes[17–21]. A pangenome is the collection of genes found in all strains within a population. By considering the set of highly conserved genes within a pangenome, we are close to obtaining a data set consisting of “true” genes since these sequences are highly conserved across many strains and are therefore considerably more reliable than data sets based on genes from one genome sequence only. Thus, we argue that data sets consisting of genes obtained from pangenomic inspired analyses may be an adequate starting point for a general testing and comparison of gene finders. Indeed, we use such sequence data to compare the capabilities of a multivariate coding sequence modeling algorithm using different methodology to that of the Markov chain based coding sequence modeling algorithms. Although multivariate methods (e.g.[22, 23]) are extensively applied in other scientific fields only one such method known to the authors has been suggested as a gene finding algorithm. Data sets used for gene finding typically have a large number of variables p (usually frequency counts of oligonucleotide like codons) in comparison to the number of ORFs n. As a consequence we have to deal with the unbalanced p > n situation, making it hard to classify ORFs since unique estimates cannot be found. Multivariate tools like Partial Least Square (PLS) regression are widely used to address unbalanced p > n problems. A recent advancement to the PLS regression scheme combines a novel data compression method, canonical correlation analysis (CCA), to additionally estimate latent variables enhancing classification in regression type problems even further. This method has been termed Canonical Powered Partial Least Squares (CPPLS) and we explore the performance on the modeling of coding genes.
Gene modeling data
The genomic data which was used to train the coding sequence modeling algorithms was divided into two groups. One group, termed ‘Positives’, contained ORFs considered to be real genes. The other group, termed ‘Negatives’, consisted of ORFs known to be non-coding, i.e. sequences recognized as non-genes. We only considered protein coding genes in this study.
An overview of the species
Number of genomes
Algorithms involved in coding sequence modeling must separate sequences that are genes and sequences that are not genes. Sequences that are not genes are designated Negatives. The set of Negatives will further enable classifying sequences as coding or non-coding genes. Negatives are considerably harder to identify than Positives since prokaryotic genomes are densely covered with genes. Even if a sequence is not among our HCOs it may very well be a coding gene, or at least part of a coding gene. However, the reading frame is an indispensable concept with respect to coding sequences, as elaborated by, and due to different selection pressure in-frame and out-of-frame sequences are evaluated differently and form completely separate clusters. Consequently, we consider the out-of-frame interior from the set of Positives as Negatives in the current study. This implies that no Positive has a complete overlap with another Positive. It is, however, typically accepted that functional genes in prokaryotes can overlap over short stretches. Hence, a small fraction of our Negatives may actually be part of a gene, making Negatives difficult to classify correctly. There are always 5 out-of-frame reading frames, and all are considered as Negatives, i.e. for each Positive we have 5 Negatives. Sequences designated as Negatives will hence not have a proper start and stop codon, but are likely to contain spurious stop codons since they are out-of-frame. In order to use this approach, we therefore eliminated the first start- and all stop-codons from every sequence labeled as either Negative or Positive.
Data splitting and cross-validation
For each species, the sets containing Positives and Negatives were randomly divided into 10 equally sized subsets. One of these subsets was used as test data while the other 9 subsets were used as training data. The procedure was repeated in a 10-fold cross-validation.
ORF sequence representation
Genes can be represented as DNA sequences, codon sequences or protein (amino acid) sequences. We describe all representations below with respect to coding sequence modeling.
The DNA alphabet consists of 4 symbols; but the reading frame concept must also be taken into consideration. Hence, the bases we observe in codon positions 1, 2 and 3 must be considered separately, otherwise it is impossible to distinguish in-frame from out-of-frame sequences. Markov chain models therefore need three separate sets of transition probabilities, each set corresponding to the target symbol in reading frame 1, 2 or 3. The pretext, i.e. the subsequence a Markov chain model is conditioned upon, consists of all preceding k symbols regardless of which reading frame is considered. A Markov chain model will therefore traverse a DNA sequence, nucleotide by nucleotide, constantly consulting transition probabilities from all reading frames. Such is the case for GeneMark and GLIMMER. From this perspective, the DNA alphabet of coding sequences has 4 ∗ 3 = 12 and not 4 symbols.
Each protein coding gene may also be represented by its codon alphabet. The codons consist of three consecutive nucleotides and code for amino acids, thereby giving 64 possible combinations. Ignoring the 3 exclusive stop codons, 61 symbols are free to code for amino acids. Since there are only 20 different standard amino acids, the codon alphabet is redundant. In other words, some codons code for the same amino acid. Hence, some codons are synonymous while others are non-synonymous. In fact, the redundancy of the codon alphabet allows organisms and genes to prefer specific codons coding for specific amino acids. This is typically known as codon bias. Although the codon alphabet, with its 61 symbols, provides more resolution than the DNA and protein alphabets, the added information can be a computational challenge.
Due to the redundancy of the codon alphabet gene comparisons may often be more successful using protein sequences. Since different codons can code for the same amino acid, DNA sequences representing homologue genes may be very different in terms of base composition and therefore hard to detect using DNA based search engines. In such cases, using protein sequences instead of DNA sequences may give better results since there is no redundancy. Protein sequences are expected to be highly conserved by purifying selection, in contrast to the more variable DNA sequences.
Classification of coding sequence
The methods used to classify genes were Interpolated Markov model (IMM) and Canonical Powered PLS (CPPLS). Both models need to be trained and from the training data set of n sequences we create a n × 1 numeric response vector y containing the value 1 if the respective sequence is from the Positive set and -1 if the respective sequence is from the Negative set.
Interpolated Markov models (IMM)
Markov chain models are widely used to detect patterns in biological sequences. Unfortunately, these models are hampered by the necessity to find the appropriate order of the Markov chain. A higher order Markov chain model has more parameters and therefore less bias since it is capable of describing more accurately the real probabilities behind the observed sequences. However, for a fixed size data set the information per parameter is less, resulting in estimators with increased variance. Thus, the improvement obtained due to less bias may be lost to the increased variance. A fifth order Markov chain model is employed by GeneMark, while the gene finding algorithm in Glimmer is based on the interpolated Markov model (IMM). The latter model (IMM) estimates several chains with different orders, of which the separate scores are subsequently combined into one, making it a more general approach than the prior 5 th order model. Since we are comparing coding sequence modeling algorithms we use the IMM approach used by the Glimmer software. This means that the final probability of a symbol is a linear combination of several Markov chain models from order k = 0 up to some upper limit k = K, where the Markov chain transition probabilities from various orders are weighted based on the size and information content of the training data. Some additional effort is required to estimate these weights since there is no closed form solution for the maximum of the likelihood function. The Expectation Maximization (EM) algorithm is applied iteratively to find local optimum solutions which are consequently applied to optimize the weights used in the linear interpolation. From the training data two interpolated Markov chain models are fitted, one for Positives (+1) and the other for Negatives (-1). Thus, for both Positives and Negatives we need to estimate the transition probability matrices along with the weights used in the interpolation procedure. Then, for each sequence from the test data the posterior log-probability scores for the Positive and Negative models are computed using the estimated transition probability matrices and weights. Finally, each test set of sequences is assigned to the class (+1 or -1) depending on the log-probability score. In an approach like this, the upper model order K must be restricted due to space and computation time limitations. For the codon alphabet, having 61 symbols, even a second order model (K = 2) includes 613 = 226981 transition probabilities, and is therefore computationally very slow. Also, a training set of considerable size is required to estimate all probabilities with reasonable variance. The addition of pseudo counts is considered useful method to stabilize the estimates of a Markov chain model. We have chosen to use this as well, but in a very careful way. If we have m observations (transitions/initiations) in our data set, we add pseudo counts as well, all having probabilities given by a 0-order Markov chain model for the Positives or Negatives, respectively.
Canonical Powered PLS (CPPLS)
β are p regression coefficients relating every word frequency to the class status (+1 or -1). This results in a ’large p and small n’ situation, where ordinary least squares type methods provide poor solutions. The PLS method can estimate the regression coefficients for such a case using an iterative procedure described in. There are many algorithms in the PLS-family, and for classification purposes we use the CPPLS method. Thus, from the training data we estimate the regression vector β describing the contrast between Positives and Negatives. For a given test sequence, the corresponding word frequency 1 × p vector x is computed. Based on the CPPLS estimated regression coefficients a score is predicted by classified as +1 or -1, that is as Negative or Positive.
The number of probabilities to be estimated in an IMM
k = 0
k = 1
k = 2
k = 3
k = 4
k = 5
Mixed effect model
The main objective of the study is to make comparisons of methods (CPPLS vs. IMM) and sequence representations (codon vs. protein vs. DNA) on the ability to classify coding sequences. The study has been conducted on genomes from many different species, and in order to present all results in a single analysis, we have adopted an analysis of variance (ANOVA) approach. We were primarily interested in how the choice of method and sequence representation affected the classification performance (outcome), and the (random) variability in results between species should be considered as random ’noise’ in the analysis. This was accomplished by the use of a mixed-effect ANOVA model, where the fixed effects on performance are the focus of our attention (method and sequence representation) and a random effect of species is included to deal with variation between species.
The performance is defined as the percentage of correctly classified ORFs in a test data set using 10-fold cross validation. ANOVA analyses assume constant performance variance at different levels of the fixed effects, which was originally not the case in our data set. To stabilize the variance, the original performance y (percentages) was transformed to z as.
where the outcome zi,j,k is the observed transformed performance, μ is the overall expected transformed performance level, α i is the fixed effect of method i = 1, 2, β j is the fixed effect of sequence representation j = 1,2,3, (αβ)i,j is the interaction term combining method i and sequence representation j, s k is the random effect of species k = 1, …, 12 and ei,j,k is the residual variation. As part of the model assumptions in a standard ANOVA we used normal distributed error terms and.
Results and discussion
Even if the RefSeq database is curated, there may still be errors. In order to eliminate uncertain sequences we only considered those which were conserved across all genomes within each species. Additional file1: Figure S1 shows how the number of gene clusters grows by the choice of threshold t, which represents the similarity between sequences inside a cluster. In our analysis we have chosen to use t=0.3, meaning that clusters will contain sequences that are roughly 100% (1 − t) = 70% similar. For each such cluster having members from all genomes, we allocate the medoide sequence to the set of Positives for the corresponding species. As seen from Additional file1: Figure S1, this results in a rather large number of Positives for all species and we are assured that these sequences are coding genes. So instead of taking all HCOs at t = 0.3, if a species has more than 400 HCO’s, we sampled 400 sequences at random as Positives. We have chosen to use as Negatives sequences that constitute the out-of-frame interior of the Positives. The reason for this is actually straightforward; coding genes predominantly cover prokaryotic genomes therefore the intergenic regions are few and small. For instance, the RefSeq annotated genes cover, on average, more than 92% of the genomes in this study. On the other hand, annotations of genes with large overlaps are few in number; therefore we assume that if there is some region where we know there is a coding gene, there will be a small chance that any other coding gene is present in the same region. Thus, we presume that sequences from the out-of-frame interior of the Positives are the types of sequences that have the same base compositional properties as the majority of non-coding ORFs (i.e. Negatives). We also eliminated the first codon (start) as well as all stop-codons from both Positives and Negatives, in order to make the classifications based on the content and not the endpoints.
Coding sequence recognition
Analysis of variance for a mixed effect design in coding sequence modeling
Sum of squares
p < 0.001
p < 0.001
p < 0.001
p < 0.001
Although CPPLS based on codon frequencies, performs extremely well for ORF classification there are a few Positives missed. In the genome of Sulfolobus islandicus we miss an iron-sulfur binding domain protein and some hypothetical proteins. In Pseudomonas putida we fail to detect the genes annotated as “RND family efflux transporter MFP subunit”, “copper resistance B”, as well as some hypothetical proteins. In Mycobacterium tuberculosis we miss some hypothetical proteins and a “transmembrane serine” protein. For Escherichia coli we fail to classify an “intimin adherence” protein as Positive. This is a protein with no clear function defined also found in some Shigella and Citrobacter species.
We note that these genes are all involved in pathogenecity, e.g. the intimin gene is usually found on pathogenicity islands known collectively as LEE’s. Pathogenicity is a trait prone to be horizontally transferred[37, 38]. The fact that these genes are quite different in codon composition from all other HCO’s in their respective populations may indeed be taken as an indication of recent horizontal gene transfer. This illustrates another potential use of coding sequence modeling besides gene finding. When a highly conserved ORF is not recognized as such, it is an indicator of ’foreign’ DNA. The recognition of horizontally transferred genes, which are often linked to virulence factors and antibiotic resistance[39, 40], can be aided by the capability of coding sequence modeling. For instance, it is known that the GC content of the third codon position is highly correlated with genomic GC content. Since genomic GC content is associated with the environment of the bacteria[42, 43], the codon frequencies of horizontally transferred DNA may be very different to that of the host.
Results of comprehensive comparisons in coding sequence modeling on multiple data sets show that the CPPLS approach provides superior performance compared to the IMM. Furthermore, codon representations were found to be superior in classifying ORFs compared to DNA and protein representations for the CPPLS method. We therefore conclude that a multivariate approach like CPPLS should be more utilized in coding sequence modeling, as well as in pattern recognition problems where sequences are to be classified by their content, like for instance, in the detection of horizontally transferred DNA.
TM and LS initiated the project and the ideas. All authors have been involved in the later development of the approach and the final algorithm. TM has done the programming, with some assistance from SS. TM, ABK, JB and LS has drafted the manuscript, with inputs from all other authors. All authors have read and approved the final manuscript.
Tahir Mehmood’s scholarship has been fully financed by the Higher Education Commission of Pakistan.
- Ahnert S, Fink T, Zinovyev A: How much non-coding DNA do eukaryotes require? J Theor Biol 2008, 252(4):587–592. 10.1016/j.jtbi.2008.02.005View ArticlePubMedGoogle Scholar
- Toh H, Weiss B, Perkin S, Yamashita A, Oshima K, Hattori M, Aksoy S: Massive genome erosion and functional adaptations provide insights into the symbiotic lifestyle of Sodalis glossinidius in the tsetse host. Genome Res 2006, 16(2):149–156.PubMed CentralView ArticlePubMedGoogle Scholar
- Do J, Choi D: Computational approaches to gene prediction. J Microbiol Seoul 2006, 44(2):137.Google Scholar
- Warren A, Archuleta J, Feng W, Setubal J: Missing genes in the annotation of prokaryotic genomes. BMC Bioinf 2010, 11: 131. 10.1186/1471-2105-11-131View ArticleGoogle Scholar
- Angelova M, Kalajdziski S, Kocarev L: Computational Methods for Gene Finding in Prokaryotes. Web Proceedings, ISSN 2010, 1: 11–20.Google Scholar
- Delcher A, Bratke K, Powers E, Salzberg S: Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics 2007, 23(6):673–679. 10.1093/bioinformatics/btm009PubMed CentralView ArticlePubMedGoogle Scholar
- Larsen T, Krogh A: EasyGene–a prokaryotic gene finder that ranks ORFs by statistical significance. BMC bioinformatics 2003, 4: 21. 10.1186/1471-2105-4-21PubMed CentralView ArticlePubMedGoogle Scholar
- Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ: Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinf 2010, 11: 119. 10.1186/1471-2105-11-119View ArticleGoogle Scholar
- Ripley B: Pattern recognition and neural networks. United Kingdom, Cambridge Univ Pr; 2008.Google Scholar
- Borodovsky M, McIninch J: Recognition of genes in DNA sequence with ambiguities. Biosystems 1993, 30(1–3):161–171. 10.1016/0303-2647(93)90068-NView ArticlePubMedGoogle Scholar
- Borodovsky M, McIninch J: GENMARK: parallel gene recognition for both DNA strands. Comput Chem 1993, 17(2):123–133.View ArticleGoogle Scholar
- Lukashin AV, Borodovsky M: GeneMark.hmm: new solutions for gene finding. Nucl Acids Res 1998, 26: 1107–1115. 10.1093/nar/26.4.1107PubMed CentralView ArticlePubMedGoogle Scholar
- Salzberg S, Delcher A, Kasif S, White O: Microbial gene identification using interpolated Markov models. Nucl Acids Res 1998, 26(2):544. 10.1093/nar/26.2.544PubMed CentralView ArticlePubMedGoogle Scholar
- Bocs S, Cruveiller S, Vallenet D, Nuel G, Médigue C: AMIGene: annotation of microbial genes. Nucl Acids Res 2003, 31(13):3723. 10.1093/nar/gkg590PubMed CentralView ArticlePubMedGoogle Scholar
- Larsen TS, Krogh A: EasyGene–a prokaryotic gene finder that ranks ORFs by statistical significance. BMC Bioinf 2003, 4: 21. 10.1186/1471-2105-4-21View ArticleGoogle Scholar
- Azad R, Borodovsky M: Probabilistic methods of identifying genes in prokaryotic genomes: connections to the HMM theory. Briefings Bioinf 2004, 5(2):118–130. 10.1093/bib/5.2.118View ArticleGoogle Scholar
- Tettelin H, Masignani V, Cieslewicz M, Donati C, Medini D, Ward N, Angiuoli S, Crabtree J, Jones A, Durkin A, DeBoy R, Davidsen T, Mora M, Scarselli M, Ros I, Peterson J, Hauser C, Sundaram J, Nelson W, Madupu R, Brinkac L, Dodson R, Rosovitz M, Sullivan S, Daugherty S, Haft D, Selengut J, Gwinn M, Zhou L, Zafar N, Khouri H, Radune D, Dimitrov G, Watkins K, O’Connor K, Smith S, Utterback T, White O, Rubens C, Grandi G, Madoff L, Kasper D, Telford J, Wessels M, Rappuoli R, Fraser C: Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: Implications for the microbial ‘pan-genome’. Proc Nat Acad Sci USA 2005, 102(45):16530.Google Scholar
- Medini D, Donati C, Tettelin H, Masignani V, Rappuoli R: The microbial pan-genome. Curr Opin Genet Dev 2005, 15: 589–594. 10.1016/j.gde.2005.09.006View ArticlePubMedGoogle Scholar
- Chen SL, Hung CS, Xu J, Reigstad CS, Magrini V, Sabo A, Blasiar D, Bieri T, Meyer RR, Ozersky P, Armstrong JR, Fulton RS, Latreille JP, Spieth J, Hooton TM, Mardis ER, Hultgren SJ, Gordon JI: Identification of genes subject to positive selection in uropathogenic strains of Escherichia coli: a comparative genomics approach. Proc Natl Acad Sci USA 2006, 103: 5977–5982. 10.1073/pnas.0600938103PubMed CentralView ArticlePubMedGoogle Scholar
- Willenbrock H, Hallin PF, Wassenaar TM, Ussery DW: Characterization of probiotic Escherichia coli isolates with a novel pan-genome microarray. Genome Biol 2007, 8: R267. 10.1186/gb-2007-8-12-r267PubMed CentralView ArticlePubMedGoogle Scholar
- Reinhardt JA, Baltrus DA, Nishimura MT, Jeck WR, Jones CD, Dangl JL: De novo assembly using low-coverage short read sequence data from the rice pathogen Pseudomonas syringae pv. oryzae. Genome Res 2009, 19: 294–305.PubMed CentralView ArticlePubMedGoogle Scholar
- Indahl U, Liland K, Næs T: Canonical partial least squares a unified PLS approach to classification and regression problems. J Chemom 2009, 23(9):495–504. 10.1002/cem.1243View ArticleGoogle Scholar
- Trygg J, Wold S: Orthogonal projections to latent structures (O-PLS). J Chemom 2002, 16(3):119–128. 10.1002/cem.695View ArticleGoogle Scholar
- Zhu H, Hu GQ, Yang YF, Wang J, She ZS: MED: a new non-supervised gene prediction algorithm for bacterial and archaeal genomes. BMC Bioinf 2007, 8: 97. 10.1186/1471-2105-8-97View ArticleGoogle Scholar
- Martens H, Næs T: Multivariate Calibration. Wiley, United Kingdom; 1989.Google Scholar
- Pruitt K, Tatusova T, Klimke W, Maglott D: NCBI Reference Sequences: current status, policy and new initiatives. Nucl Acids Res 2009, 37(suppl 1):D32-D36.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang Z, Schwartz S, Wagner L, Miller W: A greedy algorithm for aligning DNA sequences. J Comput Biol 2000, 7(1–2):203–214. 10.1089/10665270050081478View ArticlePubMedGoogle Scholar
- Ouyang Z, Zhu H, Wang J, She Z: Multivariate entropy distance method for prokaryotic gene identification. J Bioinf Comput Biol 2004, 2(2):353–374. 10.1142/S0219720004000624View ArticleGoogle Scholar
- Chih-Hsien C, Chung-Han Y, Hsien-Tai C, Lung L: Reconstructing genome trees of prokaryotes using overlapping genes. BMC Bioinf 2010, 11: 102. 10.1186/1471-2105-11-102View ArticleGoogle Scholar
- Sharp P, Li W: The codon adaptation index-a measure of directional synonymous codon usage bias, and its potential applications. Nucl Acids Res 1987, 15(3):1281. 10.1093/nar/15.3.1281PubMed CentralView ArticlePubMedGoogle Scholar
- Bählmann P, Wyner A: Variable length Markov chains. Ann Stat 1999, 27(2):480–513. 10.1214/aos/1018031204View ArticleGoogle Scholar
- Ohler U, Harbeck S, Niemann H: Interpolated markov chains for eukaryotic promoter recognition. Bioinformatics 1999, 15(5):362. 10.1093/bioinformatics/15.5.362View ArticlePubMedGoogle Scholar
- Yada T, Totoki Y, Takagi T, Nakai K: A novel bacterial gene-finding system with improved accuracy in locating start codons. DNA Res 2001, 8(3):97–106. 10.1093/dnares/8.3.97View ArticlePubMedGoogle Scholar
- Hastie T, Tibshirani R, Friedman J: The elements of statistical learning. Springer Series in Statistics 2001, 1: 37–38.Google Scholar
- McGill R, Tukey J, Larsen W: Variations of box plots. Am Statistician 1978, 32: 12–16.Google Scholar
- Schmidt M: LEEways: tales of EPEC, ATEC and EHEC. Cell Microbiol 2010, 12(11):1544–1552. 10.1111/j.1462-5822.2010.01518.xView ArticlePubMedGoogle Scholar
- Lesic B, Carniel E: Horizontal transfer of the high pathogenicity island of Yersinia pseudotuberculosis. J Bacteriol 2005, 187(10):3352. 10.1128/JB.187.10.3352-3358.2005PubMed CentralView ArticlePubMedGoogle Scholar
- Faguy D, Doolittle W: Horizontal transfer of catalase peroxidase genes between archaea and pathogenic bacteria. Trends Genet: TIG 2000, 16(5):196. 10.1016/S0168-9525(00)02007-2View ArticlePubMedGoogle Scholar
- Pallen M, Wren B: Bacterial pathogenomics. Nature 2007, 449(7164):835–842. 10.1038/nature06248View ArticlePubMedGoogle Scholar
- Fournier P, Drancourt M, Raoult D: Bacterial genome sequencing and its use in infectious diseases. Lancet Infectious Diseases 2007, 7(11):711–723. 10.1016/S1473-3099(07)70260-8View ArticlePubMedGoogle Scholar
- Wan X, Xu D, Kleinhofs A, Zhou J: Quantitative relationship between synonymous codon usage bias and GC composition across unicellular genomes. BMC Evolutionary Biol 2004, 4: 19. 10.1186/1471-2148-4-19View ArticleGoogle Scholar
- Chen L, Zhang C: Seven GC-rich microbial genomes adopt similar codon usage patterns regardless of their phylogenetic lineages. Biochem Biophys Res Commun 2003, 306: 310–317. 10.1016/S0006-291X(03)00973-2View ArticlePubMedGoogle Scholar
- Foerstner K, Von Mering C, Hooper S, Bork P: Environments shape the nucleotide composition of genomes. EMBO R 2005, 6(12):1208–1213. 10.1038/sj.embor.7400538View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.