Gene prediction using the Self-Organizing Map: automatic generation of multiple gene models
© Mahony et al 2004
Received: 26 November 2003
Accepted: 05 March 2004
Published: 05 March 2004
Skip to main content
© Mahony et al 2004
Received: 26 November 2003
Accepted: 05 March 2004
Published: 05 March 2004
Many current gene prediction methods use only one model to represent protein-coding regions in a genome, and so are less likely to predict the location of genes that have an atypical sequence composition. It is likely that future improvements in gene finding will involve the development of methods that can adequately deal with intra-genomic compositional variation.
This work explores a new approach to gene-prediction, based on the Self-Organizing Map, which has the ability to automatically identify multiple gene models within a genome. The current implementation, named RescueNet, uses relative synonymous codon usage as the indicator of protein-coding potential.
While its raw accuracy rate can be less than other methods, RescueNet consistently identifies some genes that other methods do not, and should therefore be of interest to gene-prediction software developers and genome annotation teams alike. RescueNet is recommended for use in conjunction with, or as a complement to, other gene prediction methods.
Computational gene prediction methods have yet to achieve perfect accuracy, even in the relatively simple prokaryotic genomes. Problems in gene prediction centre on the fact that many protein families remain uncharacterised. As a result, it seems that only approximately half of an organism's genes can be confidently predicted on the basis of homology to other known genes [1–3], so ab initio prediction methods are usually employed to identify many protein-coding regions of DNA.
Currently, the most popular prokaryotic gene-prediction methods, such as GeneMark.hmm  and Glimmer2 , are based on probabilistic Markov models that aim to predict each base of a DNA sequence using a number of preceding bases in the sequence. These methods are undoubtedly very successful, with published sensitivity rates between 90% and 99% for most prokaryotic genomes. However, as the sensitivity rates of the methods rise, specificity generally tends to fall, and while the application of sophisticated post-processing rules can correct many false-positive predictions, no method has yet achieved 100% accuracy. This is especially the case in the more complex eukaryotic gene-finding problem, where less than 80% of exons in anonymous genomic sequences are correctly predicted by current methods [2, 6–8].
For the foreseeable future it does not seem that the exact set of genes in any organism can be automatically predicted by any single computational method. In practice, this has meant that the best predictions are to be found by combining evidence from two or more independent methods [3, 9]. Genome annotation teams often compare the evidence offered by multiple gene-finders in order to predict the gene complement of a given genome. Because of the degree of 'manual' annotation that now takes place in the major genome sequencing centres, a gene-prediction tool will be of practical use if it can exclusively predict genes that other gene-finders cannot.
Many ab initio gene-prediction methods are based on single models of protein-coding regions and therefore make the implicit assumption that all protein-coding regions within a particular genome will share similar statistical properties. However, evidence has mounted that single gene models of intrinsic coding measures are no longer fully satisfying [10, 11]. The problem with single model methods centres on the degree of oligonucleotide composition variation that exists within most genomes. On the codon level, intra-genomic variation in codon bias has long been correlated with expression level . Counterbalancing the translational selection theory of codon bias is the effect of mutational bias [13, 14]. Many other, often more subtle levels of variation have been recognised over the years, with many disparate evolutionary pressures shown to be acting on codon usage bias . For example, strand-specific codon usage biases have often been recognised [16–20], leading to more general studies of correlation between the location of the gene on the genome and codon bias , and the more specific discovery of a A+T skewed bias near the replication terminus of bacterial genomes . Other effects shown to shape codon usage are gene length  and selection at the amino acid level . It has also been suggested that content variation can occur at the exon level in eukaryotic genes, the possibility existing that some exons in a gene may have different codon usage patterns to others . Given that some of the above pressures on codon usage have only recently been discovered, it is likely that some more subtle patterns have yet to be recognised, and therefore it is difficult to predict the level of compositional variation that will be present in an anonymous genomic sequence.
The need for gene-finding methods that can overcome the problems presented by intra-genomic variation was recognised and addressed in the case of prokaryotic genomes by GeneMark-Genesis , which derives two models for each genome according to typical and atypical codon usage clusters in that genome. This increase in the number of gene models led to an increase in accuracy of the GeneMark method. While Hayes & Borodovsky experimented with a third ('highly-typical') codon usage cluster and an associated model in some cases, they did not see the need to further sub-cluster the atypical codon usage set in order to make even more models. Overly sub-clustering the training data would not be useful in the case of Markov-based methods, as the data contained in each sub-cluster may not allow for a good estimation of model parameters. However, generating more specific models for subtle patterns found in the training set can only be advantageous if it can be done in a way that minimises loss of overall accuracy and produces no extra false-positive predictions.
This paper aims to show how the Self-Organizing Map neural network algorithm can be used to automatically identify the major trends in oligonucleotide variation in a genome, and in doing so provide multiple gene models for use in gene prediction. It will be explained that this approach is an effective solution to the problem of intra-genomic variation. Specific examples of genes predicted only by this method are offered, thus demonstrating the usefulness of the approach in genome annotation.
A further advantage of using the Self-Organizing Map for gene prediction is the ability of the algorithm to use complex descriptors as measures of gene coding potential. We demonstrate this ability using relative synonymous codon usage (RSCU) as our measure of gene coding potential. Unlike other gene coding measures, RSCU is not based on the absolute frequency of k-mers, but instead describes the codon choice for each amino acid. Markov chains based on the RSCU measure would have transition probabilities that are conditioned on the underlying amino-acids. Although theoretically possible , the practical computation of such Markov chains would give rise to major difficulties. Therefore, the ability of our approach to make use of a sophisticated gene coding descriptor such as RSCU is a distinct advantage of our approach over Markov model based methods.
In this study, relative synonymous codon usage (RSCU) vectors are used as the measure of protein-coding potential for a given window of sequence. The RSCU value for a codon 'i' is defined as:
where Obsi is the observed number of occurrences of codon 'i', and Expi is the expected number of occurrences of the same codon (based on the number of times the relevant amino acid is present in the gene and the number of synonymous alternatives to 'i', assuming a uniform choice of synonymous codons). In order to make the data more compatible with the mathematical methods used, the log of each RSCUi value is found so that the resulting value is positive if the codon is used more than expected in that gene, and negative if the codon is used less than expected. Values were capped at ± 10, and set to 0 in the case of the non-occurrence of an amino acid in the sample. Taking the RSCU values for each of the codons with synonymous alternatives (and ignoring the 3 stop codons and the Trp and Met codons), each sample can be represented by a vector of 59 values.
A vector (X i ), corresponding to a gene's RSCU values, is loaded from the training dataset.
The lattice node is found whose model vector most closely resembles the input pattern. This node is denoted the 'winning node'.
The winning node's model, W (as well as a certain number of 'neighbourhood bubble' node models) is changed to be more similar to the input vector by the equation:
If all the vectors in the training dataset are processed, we say that an epoch has been completed. In this study, all SOMs are trained for 3000 epochs.
The 'neighbourhood bubble' mentioned in step 3 is a group of nodes centered at the winning node. The radius of this bubble is initialised to be large and is linearly decreased during training until only the winning node's model is changed. Changing the models on the winning node's neighbours allows the clustering of similar patterns. The learning rate (η) in step 3 is initialised close to 1 and is also linearly decreased during training until it is held constant at a predefined fraction. The linear decrease in learning rate means that each node's model will not get changed as much or as often as training progresses. Two recognised phases of training result; an ordering phase where the lattice takes its general shape, and a convergence phase where the nodes get more specialised to respond to specific patterns.
In this work, similarity between two vectors is measured by finding the cosine of the angle between them. A cosine of 1 represents exactly similar vectors while a cosine of 0 represents exactly dissimilar vectors.
The SOM is used mainly in data visualisation, as it can be effectively used to reduce high-dimensional data to a two dimensional map. One of the main strengths of the method is the ability to automatically cluster similar patterns in its training set. In the context of codon usage data, the SOM has been previously used to cluster genes on the basis of similar codon usage [28–30]. However, the previous studies have concentrated on identifying genes with atypical codon usage and hypothesising their origin as horizontally transferred genes. It has since been shown that atypical codon usage is not sufficient evidence to show that a gene has origins in horizontal transfer events [22, 31]. In contrast, this study uses the fact that once a SOM has been trained using codon usage information, the nodes of the SOM encapsulate models that are representative of the major codon usage patterns within the training set.
If a new sequence is inputted to a trained SOM, we can easily be told which node's model is most similar to this new sequence, and most importantly, how similar. The similarity (cosine) score is then converted to the probability that the sequence is protein-coding. This is achieved by finding the mean cosine score received by a set of random length, random sequence genes that are generated using the same nucleotide bias as the mutational bias found in the genome. Using the mean score, each similarity score can be converted to a z-score, which is in effect the probability that the sequence is not a random sequence.
Separate SOMs are trained for each of the 15 genomes under test. The SOMs are each 15×15 nodes in size and trained for 3000 cycles. Finding genes via homology search is usually the first step to be carried out in a genome annotation process, so our training sets consist of all genes in the relevant organism that were previously confirmed by homology searches and are also at least 750 bp long. Note that unlike other gene-finding methods, no statistical knowledge of non-coding DNA is necessary as part of the SOM's training.
In analysing an entire genome sequence, a sliding window is used to split each of the six reading frames into small samples. The default window size is 110 triplets, which has been chosen as a balance between having a window size long enough to evaluate a meaningful RSCU vector, and short enough to predict short genes. Each window is offset from the next by 10 triplets. An arbitrary probability score of 0.1 is used as the threshold for deciding if a sequence was protein-coding, and all samples that scored higher than 0.1 are recorded as predictions. If a stop codon lies in the sample, the gene prediction is annotated as having ended at that point. Note, however, that no effort is made to find stop codons if they are not within the prediction, and no effort whatsoever is made to find any start codons in the prediction.
Once all the samples are processed, some simple post-processing is carried out. Naturally, all same-frame concurrent predictions are merged. Predictions that are totally overlapped by another prediction are deleted if they are less than 75% the length of the other. Similarly, any prediction in which more than half its length is overlapped is deleted if it is less than half the length of the other prediction. Alternatively, any prediction that is less than 90% as long as the overlapping prediction and receiving a lower score is deleted. Finally, any prediction that is overlapped on both ends to a total overlap of at least 70% is also deleted. A prediction size of 75 codons was found by trial and error to be the smallest gene-coding region that could accurately be found using RescueNet.
While the above rules aim to delete smaller erroneous predictions, it is recognised that the loose nature of the rules leave room for many other overlapping predictions. However, it was found that in many overlapping cases it was difficult to decide which prediction to delete. Therefore, the best solution is to leave both predictions rather than misleading an annotator by giving only one, possibly erroneous, prediction.
In assessing the accuracy of our method, we had to take into account that our method will not predict most start sites, and some stop sites, exactly. We assume that our method will be of most use to annotation teams who rigorously inspect the results of our method in conjunction with the results of other gene prediction programs. Such annotation teams base their final genome annotation on widespread evidence, so the fact that our method may produce inexact start and stop sites will not be a major disadvantage. Therefore, a correct prediction is defined here as one that predicts more than 50% of an annotated gene in the correct frame. This criterion means that only predictions that are useful to annotators are considered to be correct.
Previous studies discuss the possibility that the GenBank annotation of various genomes may be incomplete or incorrect in some cases [5, 32]. Since many GenBank annotations are not experimentally corroborated, this possibility remains strong. Large-scale benchmarking of gene-prediction algorithms is therefore difficult, because few 'gold standard' annotations exist for prokaryotic genomes. Also, in most cases hypothetical gene annotations in the public databases have their roots in the predictions of an ab initio method, thus biasing any comparison of accuracy in favour of the particular method used in the annotation of that genome. However, for the purpose of defining accuracy in this study we must assume that all GenBank annotations are correct and complete. Sensitivity (Sn) is defined here as the percentage of GenBank gene records that are predicted correctly by our method. Specificity (Sp) is defined as the percentage of total RescueNet gene predictions that are correct.
Accuracy of RescueNet in 15 bacterial genomes.
Number of Genes Annotated
Training Set Size
Sn. >225 bp (%)
Sn. Conserved (%)
Sequence data used in this study include the following 15 genomes and associated published genes available from the GenBank database: A. aeolicus , B. subtilis , Buchnera sp. , B. burgdorferi , C. jejuni , D. radiodurans (chromosome 1) , E. coli , H. influenzae , H. pylori , M. genitalium , M. jannaschii , R. solanacearum , S. coelicolor , Synechocystis sp. , and Y. pestis . These genomes were chosen to be representative of a wide range of GC content.
Three of the genomes tested have very high G+C content (D. radiodurans, R. solanacearum and S. coelicolor). High G+C content genomes present a problem to many gene-finding methods because of the relative infrequency of randomly occurring stop codons. The scarcity of stop codons has the effect of a large number of long, overlapping ORFs occurring in the sequence, relatively few of which are actually protein-coding. Many of the current gene-finders fail to discriminate accurately between coding and non-coding ORFs in this type of situation.
Another interesting pointer to the advantages of RescueNet in high G+C content genomes is the substantially higher percentage of genes with database matches that are correctly predicted by RescueNet (see Table 1). In D. radiodurans, for example, 92.54% of genes with database matches are correctly predicted by RescueNet compared with only 84.28% of the total GenBank gene annotations. These figures suggest that hypothetical genes that are predicted only by Markov-based methods are poorly recognised by RescueNet, possibly because many hypothetical genes in high G+C content genomes may in fact be false gene predictions.
The general location of frameshifts within a gene sequence can be found by our method. Two features of our approach facilitate this. Firstly, even though the overall codon usage of a frameshifted gene could seem unusual, the two coding sections of the gene should each retain the organism's native codon usage. Secondly, our approach does not require that a prediction be bounded by a start and a stop codon. The sliding window used in our algorithm can therefore predict the correct coding frames each side of the frameshift.
In an interesting example in Figure 1, two RescueNet predictions overlap the D. radiodurans gene DR1143 in such a way that it seems that there may be a frameshift that extends the protein-coding region of the gene past the annotated stop codon. In fact, combining the two RescueNet predictions offers a better database match to the same genes that the original annotation matches. This increases the possibility that the actual gene contains an authentic frameshift or at least that the extra RescueNet prediction is an evolutionary artefact.
There may be a perception that any method using codon usage as the coding measure will only give predictions that are a subset of the predictions given by a Markov-based method that uses a 4th or 5th order model. To counter this argument, we compared the predictions of our method in two genomes (H. influenzae and H. pylori) to those of the web-based version of GeneMark.hmm 2.1 for prokaryotes http://opal.biology.gatech.edu/GeneMark/gmhmm2_prok.cgi, which generated results using two models; the 'typical' and 'atypical' models.
The published sensitivity of GeneMark.hmm in the H. influenzae genome (96.2%, see ) is higher than that of our method, and the published specificity (89.8%) is lower, so GeneMark.hmm should give more predictions overall for this genome. However, 11 H. influenzae genes are predicted correctly by our method which are not predicted by GeneMark.hmm using 5th order models, and 14 genes are predicted correctly by our method which are not predicted by GeneMark.hmm using 4th order models. In the H. pylori genome, GeneMark.hmm has again a higher sensitivity and a lower specificity (94.0% & 91.3% respectively), but even more genes are exclusively predicted by our method; 25 genes as compared to the 5th order GeneMark.hmm models and 30 genes as compared to the 4th order models. Although these genes represent a small proportion of the total number of genes in the respective organisms, the fact that they are only predicted by RescueNet gives some indication of the advantage of using RescueNet in conjunction with other gene prediction methods.
RSCU is only one of many possible criterion with which to measure coding potential (see  for a review of others). In-phase hexamers are accepted as the most accurate k-mer frequency based measure of coding potential, and so their use as the coding measure in a Self-Organizing Map may offer improvement in accuracy over RescueNet. However, the larger space dimension of the hexamer coding measure may force a larger sliding window to be used and therefore the use of hexamers could actually decrease the precision of gene prediction.
The future use of alternative coding measures with our approach may also help to overcome difficulties in recognising genes that are reputed to be horizontally transferred in origin. Horizontally transferred genes would be more likely to have dissimilar codon usage patterns to other genes in the genome. Since our approach currently relies on the codon usage patterns it finds in the training set, it is unlikely to mark areas of unseen codon usage as protein-coding regions. Note, however, we are not suggesting all genes that were not recognised by our approach are of horizontally transferred origin. There are many explanations for a gene displaying atypical codon usage, and codon usage cannot be used as an accurate indicator of horizontal transfer.
There may be other ways to improve the accuracy of our method. The current implementation has a rather simple post-processing step that does not rely on modifying the prediction in order to include start or stop codons. While the practise of not constraining a prediction to be bound by a start and stop codon stands in stark contrast to other methods, we did not wish to lengthen or shorten any predictions artificially, since doing so can mislead annotation teams (especially in start site annotation). Relatively simple post-processing steps may, in fact, be advantageous. Our predictions represent a raw account of regions of the genome that display typical or native codon usage patterns, and this in itself may be of interest to annotation teams who use codon usage plots as the basis for some genomic feature annotations.
Gene-finding in prokaryotic genomes is still not a completely solved problem, partly because current methods use a limited number of models to represent the training data. In this paper, we have introduced an alternative, independent approach to the problem. The Self-Organizing Map approach has the potential to overcome the issue of variation in the statistical properties of the training set data, and can automatically train a representative number of gene-models, depending on the degree of variation within the training data.
While the current implementation of our approach produces lower raw sensitivity scores in comparison to established Markov-based techniques, we have clearly shown that our method can predict some genes that other methods cannot. We have also demonstrated advantages in annotating the traditionally 'difficult' high G+C content genomes. Annotation teams who are concerned with the complete and accurate annotation of a sequenced genome should find our method useful when used alongside other gene-finding methods. The relatively high specificity of our method, coupled with the independent nature of the algorithm, should make it a useful tool in confirming the predictions of other software programs and in some cases pointing out areas of conflicting or contradictory predictions that are worthy of further examination.
Project name: RescueNet
Project home page: http://bioinf.nuigalway.ie/RescueNet/
Operating systems: Windows, Linux, IRIX, Digital Unix. Source code also available.
Programming Language: C++
Licence: GNU GPL
Restrictions to use by non-academics: Please contact the authors.
The authors would like to thank Kim Rutherford of the Welcome Trust Sanger Institute Pathogen Sequencing Group for his help and suggestions. We also thank the two anonymous reviewers for their useful comments. S.M. is funded by an EMBARK postgraduate fellowship from the Irish Research Council for Science, Engineering & Technology.
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.