Exploration of multivariate analysis in microbial coding sequence modeling
© Mehmood et al.; licensee BioMed Central Ltd. 2012
Received: 27 February 2012
Accepted: 24 April 2012
Published: 14 May 2012
Skip to main content
© Mehmood et al.; licensee BioMed Central Ltd. 2012
Received: 27 February 2012
Accepted: 24 April 2012
Published: 14 May 2012
Gene finding is a complicated procedure that encapsulates algorithms for coding sequence modeling, identification of promoter regions, issues concerning overlapping genes and more. In the present study we focus on coding sequence modeling algorithms; that is, algorithms for identification and prediction of the actual coding sequences from genomic DNA. In this respect, we promote a novel multivariate method known as Canonical Powered Partial Least Squares (CPPLS) as an alternative to the commonly used Interpolated Markov model (IMM). Comparisons between the methods were performed on DNA, codon and protein sequences with highly conserved genes taken from several species with different genomic properties.
The multivariate CPPLS approach classified coding sequence substantially better than the commonly used IMM on the same set of sequences. We also found that the use of CPPLS with codon representation gave significantly better classification results than both IMM with protein (p < 0.001) and with DNA (p < 0.001). Further, although the mean performance was similar, the variation of CPPLS performance on codon representation was significantly smaller than for IMM (p < 0.001).
The performance of coding sequence modeling can be substantially improved by using an algorithm based on the multivariate CPPLS method applied to codon or DNA frequencies.
For each sequenced genome, the basic step of annotation is the prediction of genes. In prokaryotes, an average of over 80% of the genome consists of genes which are mostly protein coding , meaning that correct identification of protein coding genes is a key aim in computational biology. A complicating factor is that a fraction of microbial genomes consist of degenerated genes with no remaining functionality . A gene finder must therefore be a rather complex ’engine’ capable of distinguishing real protein-coding genes from DNA sequence regions consisting of degenerated genes, non-coding regions and more. To map genes, gene finders typically identify a set of gene-candidates commonly referred to as open reading frames (ORFs). The number of ORFs found by gene finders is typically large compared to the true number of genes. To reduce the number of ORFs and minimize the false predictions of real protein-coding genes, a gene finder must take into account several genomic properties like the existence of upstream regulatory sequences (ribosomal binding sites, promoter regions, etc.), degree and type of overlap between open reading frames, as well as the content of the coding genes themselves. Considering the above mentioned properties, a gene finding procedure can be sketched as follows: 1) identify all possible ORFs in the genome, 2) score all ORFs by various criteria, e.g. their length, their base composition, their upstream sequence, their overlap with other ORFs, etc. 3) classify ORFs as coding genes or non-coding regions based on the scores achieved in the previous steps.
Although the performance of prokaryotic gene finders is relatively good compared to eukaryotic gene finders [3, 4] there is room for improvement. Prokaryotic gene finders have a tendency to be biased towards identifying false positive ORFs . Short genes are difficult to identify correctly , and genes in GC rich genomes are challenging to predict accurately [6–8]. It is therefore important that proper algorithms for coding sequence modeling are implemented in gene finders. Algorithms used by gene finders should have the ability to extract sequence parameters from coding sequence modeling of putative genes (often referred to as training), and then classify new genes as ORFs based on similarity to the estimated sequence parameters . Several popular gene finders use models based on some sort of Markov chain methodology to identify ORFs [10–15]. Markov chain based models are ”trained” on a set of sequences (typically nucleotide, protein or codon frequencies) and use the statistical parameters extracted from this training to classify new sequences . The training procedure in Glimmer3 , which is a Markov chain-type model, identifies long ORFs from DNA sequences which are used to build the Interpolated Context Model. The Interpolated Context Model (IMM) is then used to classify ORFs in DNA sequences having similar characteristics to the training data sequences. This means that the classification power of gene finders based on training relies heavily on the properties of the sequence data used. Thus, for gene finders, it is important that the sequence data used for training has as many general characteristics of genes as possible, which emphasizes the relevance of procedures that facilitates sequence data for accurate gene prediction. To obtain sequence data that may have such characteristics we turn to pangenomics . The re-sequencing of multiple strains within the same species or phylotype has resulted in the study of microbial pangenomes [17–21]. A pangenome is the collection of genes found in all strains within a population. By considering the set of highly conserved genes within a pangenome, we are close to obtaining a data set consisting of “true” genes since these sequences are highly conserved across many strains and are therefore considerably more reliable than data sets based on genes from one genome sequence only. Thus, we argue that data sets consisting of genes obtained from pangenomic inspired analyses may be an adequate starting point for a general testing and comparison of gene finders. Indeed, we use such sequence data to compare the capabilities of a multivariate coding sequence modeling algorithm using different methodology to that of the Markov chain based coding sequence modeling algorithms. Although multivariate methods (e.g. [22, 23]) are extensively applied in other scientific fields only one such method known to the authors has been suggested as a gene finding algorithm . Data sets used for gene finding typically have a large number of variables p (usually frequency counts of oligonucleotide like codons) in comparison to the number of ORFs n. As a consequence we have to deal with the unbalanced p > n situation, making it hard to classify ORFs since unique estimates cannot be found. Multivariate tools like Partial Least Square (PLS) regression are widely used to address unbalanced p > n problems . A recent advancement to the PLS regression scheme combines a novel data compression method, canonical correlation analysis (CCA), to additionally estimate latent variables enhancing classification in regression type problems even further. This method has been termed Canonical Powered Partial Least Squares (CPPLS)  and we explore the performance on the modeling of coding genes.
The genomic data which was used to train the coding sequence modeling algorithms was divided into two groups. One group, termed ‘Positives’, contained ORFs considered to be real genes. The other group, termed ‘Negatives’, consisted of ORFs known to be non-coding, i.e. sequences recognized as non-genes. We only considered protein coding genes in this study.
An overview of the species
Number of genomes
Algorithms involved in coding sequence modeling must separate sequences that are genes and sequences that are not genes. Sequences that are not genes are designated Negatives. The set of Negatives will further enable classifying sequences as coding or non-coding genes. Negatives are considerably harder to identify than Positives since prokaryotic genomes are densely covered with genes. Even if a sequence is not among our HCOs it may very well be a coding gene, or at least part of a coding gene. However, the reading frame is an indispensable concept with respect to coding sequences, as elaborated by , and due to different selection pressure in-frame and out-of-frame sequences are evaluated differently and form completely separate clusters . Consequently, we consider the out-of-frame interior from the set of Positives as Negatives in the current study. This implies that no Positive has a complete overlap with another Positive. It is, however, typically accepted that functional genes in prokaryotes can overlap over short stretches . Hence, a small fraction of our Negatives may actually be part of a gene, making Negatives difficult to classify correctly. There are always 5 out-of-frame reading frames, and all are considered as Negatives, i.e. for each Positive we have 5 Negatives. Sequences designated as Negatives will hence not have a proper start and stop codon, but are likely to contain spurious stop codons since they are out-of-frame. In order to use this approach, we therefore eliminated the first start- and all stop-codons from every sequence labeled as either Negative or Positive.
For each species, the sets containing Positives and Negatives were randomly divided into 10 equally sized subsets. One of these subsets was used as test data while the other 9 subsets were used as training data. The procedure was repeated in a 10-fold cross-validation.
Genes can be represented as DNA sequences, codon sequences or protein (amino acid) sequences. We describe all representations below with respect to coding sequence modeling.
The DNA alphabet consists of 4 symbols; but the reading frame concept must also be taken into consideration. Hence, the bases we observe in codon positions 1, 2 and 3 must be considered separately, otherwise it is impossible to distinguish in-frame from out-of-frame sequences. Markov chain models therefore need three separate sets of transition probabilities, each set corresponding to the target symbol in reading frame 1, 2 or 3. The pretext, i.e. the subsequence a Markov chain model is conditioned upon, consists of all preceding k symbols regardless of which reading frame is considered. A Markov chain model will therefore traverse a DNA sequence, nucleotide by nucleotide, constantly consulting transition probabilities from all reading frames. Such is the case for GeneMark  and GLIMMER . From this perspective, the DNA alphabet of coding sequences has 4 ∗ 3 = 12 and not 4 symbols.
Each protein coding gene may also be represented by its codon alphabet. The codons consist of three consecutive nucleotides and code for amino acids, thereby giving 64 possible combinations. Ignoring the 3 exclusive stop codons, 61 symbols are free to code for amino acids. Since there are only 20 different standard amino acids, the codon alphabet is redundant. In other words, some codons code for the same amino acid. Hence, some codons are synonymous while others are non-synonymous. In fact, the redundancy of the codon alphabet allows organisms and genes to prefer specific codons coding for specific amino acids. This is typically known as codon bias . Although the codon alphabet, with its 61 symbols, provides more resolution than the DNA and protein alphabets, the added information can be a computational challenge.
Due to the redundancy of the codon alphabet gene comparisons may often be more successful using protein sequences. Since different codons can code for the same amino acid, DNA sequences representing homologue genes may be very different in terms of base composition and therefore hard to detect using DNA based search engines. In such cases, using protein sequences instead of DNA sequences may give better results since there is no redundancy. Protein sequences are expected to be highly conserved by purifying selection, in contrast to the more variable DNA sequences .
The methods used to classify genes were Interpolated Markov model (IMM)  and Canonical Powered PLS (CPPLS) . Both models need to be trained and from the training data set of n sequences we create a n × 1 numeric response vector y containing the value 1 if the respective sequence is from the Positive set and -1 if the respective sequence is from the Negative set.
Markov chain models are widely used to detect patterns in biological sequences. Unfortunately, these models are hampered by the necessity to find the appropriate order of the Markov chain. A higher order Markov chain model has more parameters and therefore less bias since it is capable of describing more accurately the real probabilities behind the observed sequences. However, for a fixed size data set the information per parameter is less, resulting in estimators with increased variance . Thus, the improvement obtained due to less bias may be lost to the increased variance. A fifth order Markov chain model is employed by GeneMark, while the gene finding algorithm in Glimmer is based on the interpolated Markov model (IMM). The latter model (IMM) estimates several chains with different orders, of which the separate scores are subsequently combined into one, making it a more general approach than the prior 5 th order model. Since we are comparing coding sequence modeling algorithms we use the IMM approach used by the Glimmer software . This means that the final probability of a symbol is a linear combination of several Markov chain models from order k = 0 up to some upper limit k = K, where the Markov chain transition probabilities from various orders are weighted based on the size and information content of the training data. Some additional effort is required to estimate these weights since there is no closed form solution for the maximum of the likelihood function. The Expectation Maximization (EM) algorithm  is applied iteratively to find local optimum solutions which are consequently applied to optimize the weights used in the linear interpolation. From the training data two interpolated Markov chain models are fitted, one for Positives (+1) and the other for Negatives (-1). Thus, for both Positives and Negatives we need to estimate the transition probability matrices along with the weights used in the interpolation procedure. Then, for each sequence from the test data the posterior log-probability scores for the Positive and Negative models are computed using the estimated transition probability matrices and weights. Finally, each test set of sequences is assigned to the class (+1 or -1) depending on the log-probability score. In an approach like this, the upper model order K must be restricted due to space and computation time limitations. For the codon alphabet, having 61 symbols, even a second order model (K = 2) includes 613 = 226981 transition probabilities, and is therefore computationally very slow. Also, a training set of considerable size is required to estimate all probabilities with reasonable variance. The addition of pseudo counts is considered useful method to stabilize the estimates of a Markov chain model . We have chosen to use this as well, but in a very careful way. If we have m observations (transitions/initiations) in our data set, we add pseudo counts as well, all having probabilities given by a 0-order Markov chain model for the Positives or Negatives, respectively.
β are p regression coefficients relating every word frequency to the class status (+1 or -1). This results in a ’large p and small n’ situation, where ordinary least squares type methods provide poor solutions. The PLS method can estimate the regression coefficients for such a case using an iterative procedure described in . There are many algorithms in the PLS-family, and for classification purposes we use the CPPLS method . Thus, from the training data we estimate the regression vector β describing the contrast between Positives and Negatives. For a given test sequence, the corresponding word frequency 1 × p vector x is computed. Based on the CPPLS estimated regression coefficients a score is predicted by classified as +1 or -1, that is as Negative or Positive .
The number of probabilities to be estimated in an IMM
k = 0
k = 1
k = 2
k = 3
k = 4
k = 5
The main objective of the study is to make comparisons of methods (CPPLS vs. IMM) and sequence representations (codon vs. protein vs. DNA) on the ability to classify coding sequences. The study has been conducted on genomes from many different species, and in order to present all results in a single analysis, we have adopted an analysis of variance (ANOVA) approach. We were primarily interested in how the choice of method and sequence representation affected the classification performance (outcome), and the (random) variability in results between species should be considered as random ’noise’ in the analysis. This was accomplished by the use of a mixed-effect ANOVA model, where the fixed effects on performance are the focus of our attention (method and sequence representation) and a random effect of species is included to deal with variation between species.
The performance is defined as the percentage of correctly classified ORFs in a test data set using 10-fold cross validation. ANOVA analyses assume constant performance variance at different levels of the fixed effects, which was originally not the case in our data set. To stabilize the variance, the original performance y (percentages) was transformed to z as .
where the outcome z i,j,k is the observed transformed performance, μ is the overall expected transformed performance level, α i is the fixed effect of method i = 1, 2, β j is the fixed effect of sequence representation j = 1,2,3, (αβ) i,j is the interaction term combining method i and sequence representation j, s k is the random effect of species k = 1, …, 12 and e i,j,k is the residual variation. As part of the model assumptions in a standard ANOVA we used normal distributed error terms and .
Even if the RefSeq database is curated, there may still be errors. In order to eliminate uncertain sequences we only considered those which were conserved across all genomes within each species. Additional file 1: Figure S1 shows how the number of gene clusters grows by the choice of threshold t, which represents the similarity between sequences inside a cluster. In our analysis we have chosen to use t=0.3, meaning that clusters will contain sequences that are roughly 100% (1 − t) = 70% similar. For each such cluster having members from all genomes, we allocate the medoide sequence to the set of Positives for the corresponding species. As seen from Additional file 1: Figure S1, this results in a rather large number of Positives for all species and we are assured that these sequences are coding genes. So instead of taking all HCOs at t = 0.3, if a species has more than 400 HCO’s, we sampled 400 sequences at random as Positives. We have chosen to use as Negatives sequences that constitute the out-of-frame interior of the Positives. The reason for this is actually straightforward; coding genes predominantly cover prokaryotic genomes therefore the intergenic regions are few and small. For instance, the RefSeq annotated genes cover, on average, more than 92% of the genomes in this study. On the other hand, annotations of genes with large overlaps are few in number; therefore we assume that if there is some region where we know there is a coding gene, there will be a small chance that any other coding gene is present in the same region. Thus, we presume that sequences from the out-of-frame interior of the Positives are the types of sequences that have the same base compositional properties as the majority of non-coding ORFs (i.e. Negatives). We also eliminated the first codon (start) as well as all stop-codons from both Positives and Negatives, in order to make the classifications based on the content and not the endpoints.
Analysis of variance for a mixed effect design in coding sequence modeling
Sum of squares
p < 0.001
p < 0.001
p < 0.001
p < 0.001
Although CPPLS based on codon frequencies, performs extremely well for ORF classification there are a few Positives missed. In the genome of Sulfolobus islandicus we miss an iron-sulfur binding domain protein and some hypothetical proteins. In Pseudomonas putida we fail to detect the genes annotated as “RND family efflux transporter MFP subunit”, “copper resistance B”, as well as some hypothetical proteins. In Mycobacterium tuberculosis we miss some hypothetical proteins and a “transmembrane serine” protein. For Escherichia coli we fail to classify an “intimin adherence” protein as Positive. This is a protein with no clear function defined also found in some Shigella and Citrobacter species.
We note that these genes are all involved in pathogenecity, e.g. the intimin gene is usually found on pathogenicity islands known collectively as LEE’s . Pathogenicity is a trait prone to be horizontally transferred [37, 38]. The fact that these genes are quite different in codon composition from all other HCO’s in their respective populations may indeed be taken as an indication of recent horizontal gene transfer. This illustrates another potential use of coding sequence modeling besides gene finding. When a highly conserved ORF is not recognized as such, it is an indicator of ’foreign’ DNA. The recognition of horizontally transferred genes, which are often linked to virulence factors and antibiotic resistance [39, 40], can be aided by the capability of coding sequence modeling. For instance, it is known that the GC content of the third codon position is highly correlated with genomic GC content . Since genomic GC content is associated with the environment of the bacteria [42, 43], the codon frequencies of horizontally transferred DNA may be very different to that of the host .
Results of comprehensive comparisons in coding sequence modeling on multiple data sets show that the CPPLS approach provides superior performance compared to the IMM. Furthermore, codon representations were found to be superior in classifying ORFs compared to DNA and protein representations for the CPPLS method. We therefore conclude that a multivariate approach like CPPLS should be more utilized in coding sequence modeling, as well as in pattern recognition problems where sequences are to be classified by their content, like for instance, in the detection of horizontally transferred DNA.
Tahir Mehmood’s scholarship has been fully financed by the Higher Education Commission of Pakistan.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.