- Research article
- Open Access
What can we learn from noncoding regions of similarity between genomes?
© Down and Hubbard; licensee BioMed Central Ltd. 2004
- Received: 04 December 2003
- Accepted: 15 September 2004
- Published: 15 September 2004
In addition to known protein-coding genes, large amounts of apparently non-coding sequence are conserved between the human and mouse genomes. It seems reasonable to assume that these conserved regions are more likely to contain functional elements than less-conserved portions of the genome.
Here we used a motif-oriented machine learning method based on the Relevance Vector Machine algorithm to extract the strongest signal from a set of non-coding conserved sequences.
We successfully fitted models to reflect the non-coding sequences, and showed that the results were quite consistent for repeated training runs. Using the learned models to scan genomic sequence, we found that they often made predictions close to the start of annotated genes. We compared this method with other published promoter-prediction systems, and showed that the set of promoters which are detected by this method is substantially similar to that detected by existing methods.
The results presented here indicate that the promoter signal is the strongest single motif-based signal in the non-coding functional fraction of the genome. They also lend support to the belief that there exists a substantial subset of promoter regions which share several common features including, but not restricted to, a relative abundance of CpG dinucleotides. This subset is detectable by a variety of distinct computational methods.
- Transcription Start Site
- Mouse Genome
- Annotate Gene
- Relevance Vector Machine
- Distribute Annotation System
Since the publication of draft sequences for the human  and mouse  genomes, several groups have run large-scale comparisons of the sequences to detect regions of conserved sequence. An initial survey of these was published along with the draft mouse genome , with additional comparisons appearing since then . Briefly, protein coding genes are – as we might expect – among the most strongly conserved regions, but homologous sequences can be found throughout the genome. In total, it is possible to align up to 40% of the mouse genome to human sequence , but it seems likely that at least some of this is just random "comparative noise" – regions of sequence which serve no particular purpose but which, purely by chance, have not yet accumulated enough mutations to make their evolutionary relationship unrecognisable. However, it is widely accepted that some of the noncoding-but-similar regions, especially those with the highest levels of sequence identity between the two species, are preferentially conserved because they perform some important function. It has been estimated that around 5% of the genome is under purifying selection , indicating that mutations in these regions have deleterious effects: a strong suggestion of some important function.
Here, we apply the Eponine Windowed Sequence (EWS) sequence analysis method method which uses a Relevance Vector Machine (RVM)  to extract a minimal set of short motifs which are able to discriminate between two sets of sequences: in this case, a positive set of conserved non-coding sequences and a negative set of randomly picked non-coding sequences. The EWS model is an adaption of the Eponine Anchored Sequence (EAS) model, first applied for transcription start site prediction in  and subsequently used to predict a range of additional biological features including translation start sites and transcription termination sites [A. Ramadass, unpublished] While EAS is designed to classify individual points in a sequence – a feature which allows the model to predict precise locations for features such as transcription start sites – EWS classifies complete blocks (windows) of sequence. The basis functions (inputs) of the RVM are sums of position-weight matrix scores  across the whole window.
We considered a set of alignments made by the blastz program  between release NCBI33 of the human genome and release NCBIM30 of the mouse genome. Since unprocessed blastz aligns around 40% of human sequence to the mouse genome, we chose to focus on the 'tight' alignments. These are a subset of alignments which are rescored and thresholded using a set of parameters given in , and cover only around 5.6% of the human genome – a proportion much closer to the fraction of bases thought to be under purifying selection .
Since we were interested in non-coding features of the genome, we ignored all regions where an alignment overlaps an annotated gene structure. This removed 20.8% of aligned bases. It is possible that some genes, and especially pseudogenes, have been missed by the annotation process, so we also removed portions covered by ab initio gene predictions from the Genscan program . This eliminated an additional 4.3% of aligned bases. Finally, repetitive sequence elements annotated by the programs RepeatMasker  and trf  (5.9%) were removed from the working set. The remainder of the aligned regions were split into non-overlapping 200 base windows, ignoring any portions less than 200 bases. This gave a set of 13925 sequences which are well-conserved between human and mouse – and therefore likely to be functional – but which are very unlikely to be part of the protein-coding repertoire. These formed the positive training set for our machine learning strategy.
A negative training set of equal size was prepared by picking 200-base windows at random from the non-coding, non-repetitive portions of chromosome 6, using the same criteria to define repeats and coding sequence. While it is probable that this set also included some functional sequences, we would expect them to be represented at a substantially lower level than in the conserved set.
Motifs used in EWS homology model 1. The entries in this table show consensus sequences of the weight matrices used in the model (note that it is possible for two distinct weight matrices to have the same consensus sequence). Motifs are listed in both forwards and reverse-complement orientation, and the two sections of the table indicate whether that motif is given a positive or negative weight in the learned linear model.
We have shown here that, when presented with a set of non-coding sequences which are strongly conserved between human and mouse, a simple motif-oriented machine learning system consistently builds models which are able to detect a substantial fraction of human promoter regions with good accuracy. This strongly suggests that this promoter signal represents the most widely used motif-based signal in functional non-coding sequence. While the model learned here can clearly be applied for the purpose of genome-wide promoter annotation, in practise existing methods offer better coverage and (in the case of the EponineTSS predictor) predictions for the precise location of the transcription start site.
It is interesting that the promoter model learned by this technique detected substantially the same set of promoters as found by the EponineTSS and PromoterInspector methods. It has previously been remarked that these two methods detect similar sets , but this could perhaps be explained by the fact that both methods were initially derived from similar sets of known promoter sequences (in both cases, training data was extracted from the EPD database . In the case of the homology models described here, there is no connection with EPD, or any similar set of known promoters: the training data was picked purely on the basis of its high similarity to corresponding portions of the mouse genome. These results therefore support the alternate view that there is a particular 'easily detected' subclass of promoter sequences.
One distinct group of promoters, which previous results show may correspond to this easily detected family, is the set of promoters associated with CpG islands . However, while a number of the motifs listed in table 1 are G/C rich and/or contain the CpG dinucleotide, by no means all of the motifs match this description, and indeed one motif containing CpG has a negative weight in the linear model – its presence in a sequence will reduce the model's output score – while some A/T rich motifs have positive weights. We therefore believe that the signals detected here are significantly more complex than a simple over-representation of CpG dinucleotides. Experiments with smaller seed-word sizes support this assumption: while dinucleotide-based models were also able to predict promoter regions, the accuracy was lower than for models including longer motifs. Finally, we show that while the predictive capacity of dinucleotide models is largely eliminated once CpG dinucleotides are removed from the sequence, models including longer words are still able to make correct promoter predictions in many cases. So while CpG dinucleotides are an important contribution to the promoter signal, they are clearly not the only component.
Genomic sequence and annotation
Human genome sequence release NCBI33 and mouse genome release NCBIM30 were extracted from Ensembl databases , which also contained gene predictions from Genscan  and repeat data from RepeatMasker  and trf . Curated annotation of gene structures on human chromosome 6 was obtained from the Vega database . Vega and Ensembl data was extracted directly from the SQL databases using the BioJava toolkit with biojava-ensembl extensions .
Human-mouse genome alignments were generated by the blastz alignment program. These were subsequently re-scored and filtered to give a 'tight' set of high-confidence alignments, as described in . We downloaded the tight alignment set from the UCSC genome website .
Pseudochromosome for testing promoter-finding methods
A 16.3 Mb pseudochromosome sequence was produced based on version 2.3 of the curated annotation for human chromosome 22. This includes all the experimentally-validated gene structures and their upstream regions, while omitting regions containing genes that are predicted but not fully verified. In the case of a pair of divergent genes where one has been verified and the second has not, their shared upstream region was cut at the midpoint. More information about pseudochromosome construction is given in .
Eponine Windowed Sequence learning
The Eponine Windowed Sequence (EWS) model is designed by analogy to the Eponine Anchored Sequence model first described in , but rather than targeting individual points in the sequence, it is designed to classify small regions or windows of a sequence, based purely on their own sequence content.
The EWS model uses the Relevance Vector Machine  algorithm to drive the training process. Relevance Vector Machines solve classification and regression problems by building Generalised Linear Models (GLMs) as weighted sums of a "working set" of basis functions. During the training process, those basis functions which are not informative are given weights close to zero and eventually discarded from the working set. To explore very large sets of possible basis functions, it is possible to add extra basis functions during the course of the training process .
The "sensors" of the EWS model are DNA position-weight matrices , which make convenient models of short sequence motifs. When using weight matrices to analyse sequence windows, we sum the weight matrix probability scores for all possible positions within the sequence. Normalising for the length of the sequence being inspected and the size of the PWM, the basis functions of the model take the form:
where W(s) is the probability that sequence s was emitted by weight matrix W, |S| is the sequence length, |W| is the weight matrix length, and denotes a subsequence from i to j.
An initial set of basis functions is proposed by taking all possible DNA motifs of a specified length (typically 5) and generating weight matrices which preferentially recognise these motifs. As the relevance vector machine trainer removes non-informative basis functions from the working set, they are replaced by applying one of the following sampling strategies to a basis function picked randomly from the working set:
Generate a new weight matrix in which each column is a sample from a Dirichlet distribution with its mode equal to the weights in the corresponding column of the parent weight matrix.
Generate a new weight matrix one column shorter than the parent by removing either the first of the last column.
By using these sampling rules, the trainer is able to explore motif space. The process of generating candidate motifs using these rules then selecting the most informative using the RVM can be seen as a form of genetic algorithm.
Chromosome 22 annotation data version 2.3 were produced by the Chromosome 22 Annotation Group at the Sanger Institute and were obtained from the World Wide Web at http://www.sanger.ac.uk/HGP/Chr22 (Dunham et al. unpublished data). TD would like to thank the Wellcome Trust for funding.
- The Genome International Sequencing Consortium: Initial sequencing and analysis of the human genome. Nature 2001, 409: 860–921. 10.1038/35057062View ArticleGoogle Scholar
- The Mouse Genome Sequencing Consortium: Initial sequencing and comparative analysis of the mouse genome. Nature 2002, 420: 520–562. 10.1038/nature01262View ArticleGoogle Scholar
- Xuan Z, Wang J, Zhang M: Computational comparison of two mouse draft genomes and the human golden path. Genome Biology 2002, 4: R1. 10.1186/gb-2002-4-1-r1PubMed CentralView ArticlePubMedGoogle Scholar
- Schwartz S, Kent W, Smit A, Zhang Z, Baertsch R, Hardison R, Haussler D, Miller W: Human-Mouse Alignments with BLASTZ. Genome Res. 2003, 13: 103–107. 10.1101/gr.809403PubMed CentralView ArticlePubMedGoogle Scholar
- Tipping M: Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research 2000, 1: 211–244. 10.1162/15324430152748236Google Scholar
- Down T, Hubbard T: Computational Detection and Location of Transcription Start Sites in Mammalian Genomic DNA. Genome Res. 2002, 12: 652–658. 10.1101/gr.216102View ArticleGoogle Scholar
- Bucher P: Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. Journal of Molecular Biology 1990, 212: 563–578.View ArticlePubMedGoogle Scholar
- Mungall A, Palmer S, Sims S, Edwards C, Ashurst J, Wilming L, Jones M, Horton R, Hunt S, Scott C, Gilbert J, Clamp M, Bethel G, Milne S, Ainscough R, Almeida J, Ambrose K, Andrews T, Ashwell R, Babbage A, Bagguley C, Bailey J, Banerjee R, Barker D, Barlow K, Bates K, Beare D, Beasley H, Beasley O, Bird C, Blakey S, Bray-Allen S, Brook J, Brown A, Brown J, Burford D, Burrill W, Burton J, Carder C, Carter N, Chapman J, Clark S, Clark G, Glee C, Clegg S, Cobley V, Collier R, Collins J, Colman L, Corby N, Coville G, Culley K, Dhami P, Davies J, Dunn M, Earthrowl M, Ellington A, Evans K, Faulkner L, Francis M, Frankish A, Frankland J, French L, Garner P, Garnett J, Ghori M, Gilby L, Gillson C, Glithero R, Grafham D, Grant M, Gribble S, Griffiths C, Griffiths M, Hall R, Halls K, Hammond S, Harley J, Hart E, Heath P, Heathcott R, Holmes S, Howden P, Howe K, Howell G, Huckle E, Humphray S, Humphries M, Hunt A, Johnson C, Joy A, Kay M, Keenan S, Kimberley A, King A, Laird G, Langford C, Lawlor S, Leongamornlert D, Leversha M, Lloyd C, Lloyd D, Loveland J, Lovell J, Martin S, Mashreghi-Mohammadi M, Maslen G, Matthews L, McCann O, McLaren S, McLay K, McMurray A, Moore M, Mullikin J, Niblett D, Nickerson T, Novik K, Oliver K, Overton-Larty E, Parker A, Patel R, Pearce A, Peck A, Phillimore B, Phillips S, Plumb R, Porter K, Ramsey Y, Ranby S, Rice C, Ross M, Searle S, Sehra H, Sheridan E, Skuce C, Smith S, Smith M, Spraggon L, Squares S, Steward C, Sycamore N, Tamlyn-Hall G, Tester J, Theaker A, Thomas D, Thorpe A, Tracey A, Tromans A, Tubby B, Wall M, Wallis J, West A, White S, Whitehead S, Whittaker H, Wild A, Willey D, Wilmer T, JM W, Wray P, Wyatt J, Young L, Younger R, Bentley D, Coulson A, Durbin R, Hubbard T, Sulston J, Dunham I, J R, Beck S: The DNA sequence and analysis of human chromosome 6. Nature 2003, 425: 805–811. 10.1038/nature02055View ArticlePubMedGoogle Scholar
- Burge C, Karlin S: Prediction of complete gene structures in human genomic DNA. Journal of Molecular Biology 1997, 268: 78–94. 10.1006/jmbi.1997.0951View ArticlePubMedGoogle Scholar
- Smit A, Green P: RepeatMasker.[http://ftp.genome.washington.edu/RM/RepeatMasker.html]
- Benson G: Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999, 27: 573–580. 10.1093/nar/27.2.573PubMed CentralView ArticlePubMedGoogle Scholar
- Hubbard T, Barker D, Birney E, Cameron G, Chen Y, L C, Cox T, Cuff J, Curwen V, Down T, Durbin R, Eyras E, Gilbert J, Hammond M, Huminiecki L, Kasprzyk A, Lehvaslaiho H, Lijnzaad P, Melsopp C, Mongin E, Pettett R, M P, Potter S, Rust A, Schmidt E, Searle S, Slater G, Smith J, Spooner W, Stabenau A, Stalker J, Stupka E, Ureta-Vidal A, I V, Clamp M: The Ensembl genome database project. Nucleic Acids Res. 2002, 30: 30–31. 10.1093/nar/30.1.38View ArticleGoogle Scholar
- Distributed Annotation System[http://www.biodas.org/]
- Collins J, Goward M, Cole C, Smink L, Huckle E, Knowles S, Bye J, Beare D, Dunham I: Reevaluating Human Gene Annotation: A Second-Generation Analysis of Chromosome 22. Genome Res. 2003, 13: 27–36. 10.1101/gr.695703PubMed CentralView ArticlePubMedGoogle Scholar
- Scherf M, Klingenhoff A, Freeh K, Quandt K, Schneider R, Grote K, Frisch M, Gailus-Durner V, Seidel A, Brack-Werner R, Werner T: First pass annotation of promoters on human chromosome 22. Genome Res. 2001, 11: 333–340. 10.1101/gr.154601PubMed CentralView ArticlePubMedGoogle Scholar
- Cross S, Bird A: The Ensembl genome database project. Nucleic Acids Res. 2002, 30: 30–31. 10.1093/nar/30.7.e30View ArticleGoogle Scholar
- Perier R, Praz V, Junier T, Bonnard C, Bucher P: The Eukaryotic Promoter Database (EPD). Nucleic Acids Res. 2000, 28: 307–309. 10.1093/nar/28.1.302View ArticleGoogle Scholar
- Vega Genome Browser[http://vega.sanger.ac.uk/]
- UCSC Genome Bioinformatics[http://genome.cse.ucsc.edu/]
This article is published under license to BioMed Central Ltd. This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.