- Open Access
DoOPSearch: a web-based tool for finding and analysing common conserved motifs in the promoter regions of different chordate and plant genes
BMC Bioinformatics volume 10, Article number: S6 (2009)
The comparative genomic analysis of a large number of orthologous promoter regions of the chordate and plant genes from the DoOP databases shows thousands of conserved motifs. Most of these motifs differ from any known transcription factor binding site (TFBS). To identify common conserved motifs, we need a specific tool to be able to search amongst them. Since conserved motifs from the DoOP databases are linked to genes, the result of such a search can give a list of genes that are potentially regulated by the same transcription factor(s).
We have developed a new tool called DoOPSearch http://doopsearch.abc.hu for the analysis of the conserved motifs in the promoter regions of chordate or plant genes. We used the orthologous promoters of the DoOP database to extract thousands of conserved motifs from different taxonomic groups. The advantage of this approach is that different sets of conserved motifs might be found depending on how broad the taxonomic coverage of the underlying orthologous promoter sequence collection is (consider e.g. primates vs. mammals or Brassicaceae vs. Viridiplantae). The DoOPSearch tool allows the users to search these motif collections or the promoter regions of DoOP with user supplied query sequences or any of the conserved motifs from the DoOP database. To find overrepresented gene ontologies, the gene lists obtained can be analysed further using a modified version of the GeneMerge program.
We present here a comparative genomics based promoter analysis tool. Our system is based on a unique collection of conserved promoter motifs characteristic of different taxonomic groups. We offer both a command line and a web-based tool for searching in these motif collections using user specified queries. These can be either short promoter sequences or consensus sequences of known transcription factor binding sites. The GeneMerge analysis of the search results allows the user to identify statistically overrepresented Gene Ontology terms that might provide a clue on the function of the motifs and genes.
The traditional bioinformatics approach for transcription factor binding site (TFBS) discovery uses collections of known (experimentally verified) TFBS sequences to find their occurrences in a promoter. These sequences are usually described by a consensus sequence or a position specific matrix (PSM). There are several different databases, whose main goal is to collect these consensus sequences and matrices. Two of the better known collections are the TRANSFAC  and JASPAR  databases. They contain information for a large number of different TFBS groups. Although the consensus sequence and matrix-based approach has been used for almost twenty years , it still has fundamental problems, such as having either too many false positives or false negatives.
Another approach is to use sophisticated computer algorithms designed for the ab initio prediction of overrepresented sequence motifs in a collection of promoter sequences [4, 5]. In this case the user provides a number of promoter regions, thought to be co-regulated or co-expressed, and the computer algorithm identifies statistically overrepresented sequence oligos, putative TFBSs. As the TFBSs are usually short, the promoter regions long, and the bases can vary in certain positions, there are similar problems, as in the method mentioned previously.
Besides the traditional analysis methods, the motif discovery algorithms can also be used to find possible TFBSs in the promoter regions of homologous gene sets. This method is called phylogenetic footprinting , and an important prerequisite of it is to collect as many orthologous promoter sequences as possible . These sequences are available in public DNA databases or in broadly known genome browsers like ENSEMBL  or UCSC . The most simple way to collect them however, is to use an orthologous promoter database, such as OMGProm  for the promoters of mammals, or DoOP  for the promoters of chordates and plants. Recent developments in comparative genomics allow us to search for conserved motifs in the full genome of a large number of species. The first collections of conserved sequence motifs are now available for downloading and/or browsing. Xie et al.  analysed the human, mouse, rat and dog whole genome alignments, and described 174 sequence motifs that were highly conserved. These motifs are now part of the JASPAR database , thereby they can be used in a promoter analysis. The cisRED  database and the CORG  framework are based on the human genome annotation from ENSEMBL and on the COMPARA database respectively, complemented by other sources. The extracted homologous gene sets in the cisRED database were analysed using different motif discovery programs in order to find conserved sequence regions. The DoOP database [11, 15] contains more than one million different conserved sequence motifs from the promoters of chordate and plant homologous gene sets. These motifs are derived from the conserved regions of DIALIGN  alignments of the orthologous promoters and thus represent a collinear set of possible TFBSs.
There are a growing number of methods and websites that offer promoter analysis tools and use different combinations of the abovementioned approaches. In general, they are more or less relying on the extra information coming from comparative genomics methods, and orthologous promoter collections [17–19].
Our aim was to develop a collection of web-based tools to search with known or de novo discovered TFBSs, as well as with longer promoter sequences, in order to collect and examine in detail genes with similar conserved motifs in their promoter regions. In this paper we describe the DoOPSearch tool that provides a new and unique method to search the motif collections of the DoOP database with a new program called MOFEXT. The server also provides a simple pattern search, based on FUZZNUC, in the promoter regions of chordate and plant genes. To analyse the Gene Ontology terms associated with a given gene in the results, we integrated the GeneMerge program into DoOPSearch. In addition, a Perl API called Bio::DOOP was also developed for the easy manipulation and querying of the DoOP database content.
Determining conserved motifs from orthologous promoters of different taxonomic groups
In the original DoOP database , we used all the available orthologous promoter sequences of each gene to generate the multiple alignments and to determine the evolutionary conserved motifs. Now we are using four and ten different taxonomic categories in the chordate and plant collections when selecting sequences for the multiple alignments (Figure 1). The narrowest category amongst the plants is the Brassicaceae family (Table 1), which includes the Brassica species besides Arabidopsis thaliana (as the originating species of the orthologous clustering process). The following is the eudicotyledons category, which also includes additional dicot species like Medicago truncatula, Solanum lycopersicum, Populus trichocarpa, Ricinus communis and others. The next one is the Magnoliophyta, which includes the monocot species like Oryza sativa and Zea mays, and finally the Viridiplantae, which incorporates every species involved in the analysis. We have defined ten taxonomic categories in the chordate section (Table 2). The first one is the Primates, which contains the Homo sapiens as the originating species and all the monkeys. The following is the Euarchontoglires, which contains the Primates and rodent species like the mouse and rat. The next category is the Eutheria with all the placental mammals. The Theria category also includes the marsupials like the opossum, and the mammalian category contains all the mammals including the egg-laying platypus. The subsequent categories are the Amniota (mammals as well as birds and reptiles), the Tetrapoda (Amniota together with the amphibians like the Xenopus species). The Teleostomi category also covers the fishes like the fugu, zebrafish or Tetraodon nigroviridis. We also have a category for the Vertebrate and Chordate taxonomical groups, but due to the drawbacks of our orthologous finding algorithm , they contain only a small number of conserved motifs and are practically unusable.
Searching in the conserved motif collections and promoter sequences
As a result of the orthologous promoter analysis, several motif collections were defined, which consist of 6–50 base pair long consensus sequences. We developed a simple new program called MOFEXT (MOtiF sEarch and EXTension) to perform fast gapless searches in the motif collections containing even more than a million of consensus sequences.
As input, the MOFEXT program requires one or more query sequences and one or more motif lists. The program slides a given length of window specified by the user, over the query sequence, and starts to compare it with the first consensus motif from the motif collections. Sliding the window over the consensus motif, the program calculates a score for each position using a predefined scoring matrix. The overall score is then calculated by adding the individual scores of each position together. The program also calculates the maximum possible score for each window of the query sequence by comparing it with itself. Using these scores, the program calculates a percentage value for each pair of query and motif sequence window. If this percentage value is larger than the cut-off value, and the length of the query and the consensus motif from the list is longer than the window size, the program tries to extend the match in both directions. This means that the program calculates the score as above for each possible overlapping window with an increasing size between the query and consensus sequence, and keeps the one with the highest score. The MOFEXT program accepts the standard IUPAC nucleotide.
For the search in the sequences of individual promoter regions, the FUZZNUC program was used from the European Molecular Biology Open Software Suite (EMBOSS) .
Developing a Perl API for handling and analysing the DoOP data
In order to allow simple access of the data available in the DoOP database we have developed a set of Perl modules called Bio::DOOP. These modules are following the style of the BIOPERL modules , and used mainly for querying the MySQL backend of the DoOP database. The Bio::DOOP modules are available from the CPAN Perl repository http://search.cpan.org/~tibi/ with a complete documentation. They are suitable for using in individual PERL scripts for querying the DoOP database through the net, and for building more complex promoter analysis tools. It is possible for example to simply download species-specific promoter sequences, determine the positions of conserved motifs, or to download the conserved motif sequences. The Bio::DOOP Perl modules also run and parse the results of the MOFEXT program, the FUZZNUC program from the EMBOSS package , and the GeneMerge Perl script .
Construction of the DoOPSearch tool
The DoOPSearch tool is running on a two processor Linux server. The forms are generated using PHP scripts. The programs are called and their results processed using Perl scripts. Most of the data, including the search results, are stored in a MySQL database. The reading and writing of the MySQL database is carried out with the Bio::DOOP Perl modules. The standalone MOFEXT program was written in standard C language. It is available upon request as source code or in precompiled binaries for different platforms. A typical MOFEXT or FUZZNUC search with default parameters ranges in about 10–20 seconds. Depending on the input data and search parameters, however, running a query and processing the results can take several minutes or up to an hour. The database design is basically the same as in the DoOP database, with an additional motif search layer, which can't be implemented with MySQL queries , and provides a simple and easy to use interface for searching, browsing and analysing the results .
The DoOPSearch tool provides an easy to use interface for searching in the promoter data provided by the chordate and plant sections of the DoOP database. The searches result in lists of genes, containing similar sequences to the query in their promoter regions. These gene lists can be further analysed with the GeneMerge program to discover statistically overrepresented Gene Ontology terms . The first step of the workflow is similar to that of the DoOP database. After choosing the chordate or plant section, the user can enter different queries and parameters for searching in the conserved motif collections and promoter sequences. The query types and parameters together with the results returned will be described in detail in the following sections.
Searching in the conserved motif collections
The MOFEXT search is suitable for searching in the different conserved motif collections. The input sequence can be a consensus sequence motif or part of a promoter/UTR sequence. The purpose of this type of search is to find genes sharing the same putative TFBSs or promoter fragments. Finding a similar sequence in a conserved position can be a strong indication of some kind of biological function. The sensitivity and specificity of the search depends on the parameters used. The word size parameter heavily influences the sensitivity and speed of the initial search. It is worth considering using a large word size, or even the whole length of the motif if the query is a known TFBS, and set a rather low cut-off value. In the case of smaller word sizes, the number of hits will increase dramatically, as the MOFEXT program applies the cut-off value prior to extending the hit motif sequence and the results will contain a large number of similar but small motifs. After testing several alternatives the EDNAFULL scoring matrix was chosen from the EMBOSS package . The consensus sequences from the DoOP database are post-processed and divided into different sub-collections of motifs. The DoOP database for example, contains orthologous promoters with three different sizes (500, 1000 or 3000 base pair length), and so the motifs fall into three different category according to the length of the originating promoters. It is also possible to pick motif collections with the consensus generated using promoter sequences only from a strictly defined monophyletic taxonomic group (described in the Methods section). Users can combine the available motif collections, but in the case of larger collections, the running time increases significantly.
The results include the cluster id of the promoter cluster containing the motif similar to the query, a short description, the type and size of the cluster, and the score of the motif. The results are sorted by the MOFEXT scores, the cluster containing the motif with the highest score coming first. It is also possible to sort the results by cluster id or filter them by the score value, cluster type or size. Furthermore, the sequences can be downloaded from here in FASTA format, and a GeneMerge analysis (described later) can be launched.
Searching in the full promoter regions
The FUZZNUC search tool utilizes the FUZZNUC program of the European Molecular Biology Open Software Suite (EMBOSS). The users can choose between the 500, 1000 and 3000 base pair promoter sets. There are also options to search only in the promoter sequences of a given reference organism (Homo sapiens in the case of chordates and Arabidopsis thaliana in the case of plants), or every sequence from different species available in each cluster of promoters. The result page is similar to the MOFEXT result page with the same options, such as observing a given promoter cluster, sorting or filtering the results, downloading the sequences or launching a GeneMerge analysis.
Gene Ontology analysis of the gene lists
If we use a single TFBS as the query in a MOFEXT or FUZZNUC search, we can assume that the genes containing the same putative TFBS in their promoter regions are under similar transcriptional control, and thus their function can also be similar. Based on this approach the gene lists can be analysed in a way similar to those coming from microarray experiments. Since most of these genes are associated with one or more Gene Ontology (GO) term , we can test the gene lists with different statistical methods to find over-represented GO categories. We have chosen the GeneMerge program  to do this on our DoOPSearch website. As the original GeneMerge program proved to be quite slow, we slightly modified it, to make it usable in our DoOPSearch tool.
The result of the statistical analysis greatly depends on the size of the input lists. In our case, both the population, and the study size can vary between different runs. The population size depends on the motif collection subsets or promoter sequence collections used in the original MOFEXT or FUZZNUC search. The study size can be altered by selecting and using different number of genes on the MOFEXT or FUZZNUC search result page. Since the results of the GeneMerge runs (especially the significance of the evaluated GO terms) can vary considerably by changing the initial MOFEXT and FUZZNUC search parameters and/or the number of genes fed to the GeneMerge program, it is worth trying several runs and several filtering parameters to find the best result.
Searching for evolutionary conserved motifs
The first motif collections of the DoOP database were generated using all the available sequences of a given promoter cluster. We noticed however, that using promoter sequences from smaller taxonomic groups improves the alignments significantly and thus can yield different conserved motifs. The evolutionary sense of this approach is the following. If for example a new TFBS had arisen in the placental mammals (Eutheria), it would appear as conserved motif only in an alignment made with only from the Eutherian sequences. In the other alignments, which in turn contain evolutionary more distant species, it can be a non-conserved and so unnoticed region. It must be also mentioned that different orthologous promoter clusters can have different number of species, which in turn can affect the determination of conserved motifs in our algorithm. We defined the taxonomic categories based on this logic. To our knowledge there are no similar motif collections available, taking into account the evolutionary distance between the species used in the phylogenetic footprinting.
Basically two types of analysis can be carried out with the DoOPSearch tool. Either a longer promoter fragment, or a short conserved motif, known TFBS can be used as a query in the MOFEXT or FUZZNUC search.
First we demonstrate how a longer sequence can be used for promoter analysis (Figure 2). Earlier we determined both with experimental and in silico analysis tools, a promoter element in the upstream region of the matrilin-1 gene . In this example we used a 300 base pair upstream fragment starting from the ATG start codon of the human matrilin-1 gene available at the DoOP database . As a control, we chose the FABP4 gene and used the same promoter region. After the MOFEXT search with exactly the same parameters, we filtered the result and ran the GeneMerge analysis. It is clear that there are specific Gene Ontology categories overrepresented in each example. It is also obvious that some categories like "transcription" or "transcription factor activity" can be found in both results. The explanation for this can be that the transcription factors contain more conserved motifs in their promoter regions than other type of genes, but to confirm this, we need to perform other analyses.
In the other case study (Figure 3) we used the binding site of the NF-kappa B gene . As a control we used the (not reverse) complement of the NF-kappa B binding site. The results clearly show the strength of our approach. Although there are some significantly enriched GO terms at the fake site as well, the difference proves the value of our analysis.
These examples show how can the DoOPSearch tool help the in silico annotation of longer promoter regions, known TFBSs or conserved motifs with unknown functions. It must be mentioned however, that to reach these result we had to try different parameters both in the MOFEXT search (word size, cut-off value and motif collection), and also in the GeneMerge analysis.
The DoOPSearch tool in combination with the Bio::DoOP Perl modules and the MOFEXT program are suitable to search for conserved promoter motifs or sequence patterns. Although there are a number of tools and services available for computational analysis of transcriptional regulation using different methods, to our knowledge DoOPSearch is the first tool which gives the opportunity to search in chordate or plant conserved motif collections. To perform such a search, we have developed a program called MOFEXT which utilizes a simple gapless word matching and scoring algorithm in order to search in a collections of conserved motifs, like our motif collections from the DoOP database.
DoOPSearch is an ideal bioinformatics tool for researchers looking for potentially co-expressed genes, or putative TFBSs in the upstream regions of different genes. Using the MOFEXT search it is relatively easy to find genes containing conserved motifs in their promoter regions similar to the query. The query can either be a short experimentally verified binding site or a promoter region with unknown TFBSs but proven regulatory functions. By recognizing that the results from the MOFEXT or FUZZNUC searches might contain co-expressed genes similar to the results of microarray experiments, we offer a unique tool to assign Gene Ontology terms and functions to a motif. The method cerainly contains a lot of ambiguity and might give false positive results, but after a detailed analysis it might be a good indicator of the role of a TFBS in gene regulation. We believe that using improved multiple alignments (with more species for example) and using more accurately defined promoter regions, our method could be improved significantly and help in the in silico annotation of TFBSs.
We constantly develop and update the data and the web interface. Due to the growing number of genome annotations and available genome sequences, the quality of the conserved motif collections will improve considerably. We also: (I) try to refine and improve the existing motif collections; (II) define and use new conserved motif collections; (III) implement new filtering functions, allowing the user to define a subset of genes (for example co-regulated genes from microarray experiments) used in the search; (IV) transfer the server to faster hardware and improve the speed of the MOFEXT search by preliminary runs, comparing the available motifs with each other using default parameters.
Wingender E: The TRANSFAC project as an example of framework technology that supports the analysis of genomic regulation. Brief Bioinform 2008, 9: 326–332. 10.1093/bib/bbn016
Bryne JC, Valen E, Tang MH, Marstrand T, Winther O, da Piedade I, Krogh A, Lenhard B, Sandelin A: JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update. Nucleic Acids Res 2008, 36: D102–106. 10.1093/nar/gkm955
Bucher P: Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. J Mol Biol 1990, 212: 563–578. 10.1016/0022-2836(90)90223-9
Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, et al.: Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol 2005, 23: 137–144. 10.1038/nbt1053
Rombauts S, Florquin K, Lescot M, Marchal K, Rouze P, Peer Y: Computational approaches to identify promoters and cis-regulatory elements in plant genomes. Plant Physiol 2003, 132: 1162–1176. 10.1104/pp.102.017715
Blanchette M, Tompa M: Discovery of regulatory elements by a computational method for phylogenetic footprinting. Genome Res 2002, 12: 739–748. 10.1101/gr.6902
Thomas JW, Touchman JW, Blakesley RW, Bouffard GG, Beckstrom-Sternberg SM, Margulies EH, Blanchette M, Siepel AC, Thomas PJ, McDowell JC, et al.: Comparative analyses of multi-species sequences from targeted genomic regions. Nature 2003, 424: 788–793. 10.1038/nature01858
Flicek P, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, Cutts T, et al.: Ensembl 2008. Nucleic Acids Res 2008, 36: D707–714. 10.1093/nar/gkm988
Karolchik D, Kuhn RM, Baertsch R, Barber GP, Clawson H, Diekhans M, Giardine B, Harte RA, Hinrichs AS, Hsu F, et al.: The UCSC Genome Browser Database: 2008 update. Nucleic Acids Res 2008, 36: D773–779. 10.1093/nar/gkm966
Palaniswamy SK, Jin VX, Sun H, Davuluri RV: OMGProm: a database of orthologous mammalian gene promoters. Bioinformatics 2005, 21: 835–836. 10.1093/bioinformatics/bti119
Barta E, Sebestyen E, Palfy TB, Toth G, Ortutay CP, Patthy L: DoOP: Databases of Orthologous Promoters, collections of clusters of orthologous upstream sequences from chordates and plants. Nucleic Acids Res 2005, 33: D86–90. 10.1093/nar/gki097
Xie X, Lu J, Kulbokas EJ, Golub TR, Mootha V, Lindblad-Toh K, Lander ES, Kellis M: Systematic discovery of regulatory motifs in human promoters and 3' UTRs by comparison of several mammals. Nature 2005, 434: 338–345. 10.1038/nature03441
Robertson G, Bilenky M, Lin K, He A, Yuen W, Dagpinar M, Varhol R, Teague K, Griffith OL, Zhang X, et al.: cisRED: a database system for genome-scale computational discovery of regulatory elements. Nucleic Acids Res 2006, 34: D68–73. 10.1093/nar/gkj075
Dieterich C, Grossmann S, Tanzer A, Ropcke S, Arndt PF, Stadler PF, Vingron M: Comparative promoter region analysis powered by CORG. BMC Genomics 2005, 6: 24. 10.1186/1471-2164-6-24
Barta E: Comparative genomics-based orthologous promoter analysis using the DoOP database and the DoOPSearch web tool. Methods Mol Biol 2007, 395: 319–328.
Morgenstern B: DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics 1999, 15: 211–218. 10.1093/bioinformatics/15.3.211
Ward LD, Bussemaker HJ: Predicting functional transcription factor binding through alignment-free and affinity-based analysis of orthologous promoter sequences. Bioinformatics 2008, 24: i165–171. 10.1093/bioinformatics/btn154
Hooghe B, Hulpiau P, van Roy F, De Bleser P: ConTra: a promoter alignment analysis tool for identification of transcription factor binding sites across species. Nucleic Acids Res 2008, 36: W128–132. 10.1093/nar/gkn195
Gotea V, Ovcharenko I: DiRE: identifying distant regulatory elements of co-expressed genes. Nucleic Acids Res 2008, 36: W133–139. 10.1093/nar/gkn300
Rice P, Longden I, Bleasby A: EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet 2000, 16: 276–277. 10.1016/S0168-9525(00)02024-2
Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JG, Korf I, Lapp H, et al.: The Bioperl toolkit: Perl modules for the life sciences. Genome Res 2002, 12: 1611–1618. 10.1101/gr.361602
Castillo-Davis CI, Hartl DL: GeneMerge – post-genomic analysis, data mining, and hypothesis testing. Bioinformatics 2003, 19: 891–892. 10.1093/bioinformatics/btg114
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al.: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25: 25–29. 10.1038/75556
Rentsendorj O, Nagy A, Sinko I, Daraba A, Barta E, Kiss I: Highly conserved proximal promoter element harbouring paired Sox9-binding sites contributes to the tissue- and developmental stage-specific activity of the matrilin-1 gene. Biochem J 2005, 389: 705–716. 10.1042/BJ20050214
Kunsch C, Ruben SM, Rosen CA: Selection of optimal kappa B/Rel DNA-binding motifs: interaction of both subunits of NF-kappa B with DNA is required for transcriptional activation. Mol Cell Biol 1992, 12: 4412–4421.
This research was supported by grants from the Hungarian Scientific Research Fund (OTKA) (NK60352 and NK72730) and from the Hungarian Ministry of Agriculture and Rural Developmen R&D grant. E. Barta was supported by the Bolyai János Scholarship. We thank Dr Gábor Tóth and Tamás Pálfy for their contribution in developing the original DoOP databases and for providing their scripts for using in this work.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 6, 2009: European Molecular Biology Network (EMBnet) Conference 2008: 20th Anniversary Celebration. Leading applications and technologies in bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S6.
The authors declare that they have no competing interests.
ES generated the motif subsets, modified the GeneMerge script, designed the website and helped drafting the manuscript. TN wrote the MOFEXT program and the Bio::DOOP Perl modules. EB initiated and coordinated the project and wrote the manuscript. ES, TN, EB and SS were all involved in the discussion of the data analysis and assisted in writing the manuscript. All authors read and approved the final manuscript.
Endre Sebestyén, Tibor Nagy contributed equally to this work.