MycoperonDB: a database of computationally identified operons and transcriptional units in Mycobacteria
© Ranjan et al 2006
Published: 18 December 2006
Skip to main content
© Ranjan et al 2006
Published: 18 December 2006
A key post genomics challenge is to identify how genes in an organism come together and perform physiological functions. An important first step in this direction is to identify transcriptional units, operons and regulons in a genome. Here we implement and report a strategy to computationally identify transcriptional units and operons of mycobacteria and construct a database-MycoperonDB.
We have predicted transcriptional units and operons in mycobacteria and organized these predictions in the form of relational database called MycoperonDB. MycoperonDB database at present consists of 18053 genes organized as 8256 predicted operons and transcriptional units from five closely related species of mycobacteria. The database further provides literature links for experimentally characterized operons. All known promoters and related information is collected, analysed and stored. It provides a user friendly interface to allow a web based navigation of transcription units and operons. The web interface provides search tools to locate transcription factor binding DNA motif upstream to various genes. The reliability of operon prediction has been assessed by comparing the predicted operons with a set of known operons.
MycoperonDB is a publicly available structured relational database which has information about mycobacterial genes, transcriptional units and operons. We expect this database to assist molecular biologists/microbiologists in general, to hypothesize functional linkages between operonic genes of mycobacteria, their experimental characterization and validation. The database is freely available from our website http://www.cdfd.org.in/mycoperondb/index.html.
Genome sequencing projects have generated large volumes of biological data which are difficult to manage and integrate effectively. This has thrown new challenges for biologists who are now supposed to decode the complex physiological information encoded within these huge genomes. A first step in this direction is to know how the various genes are organized as transcription units, operons and regulon within a genome. We have previously reported strategies and tools, such as PredictRegulon and iCR, to identify regulons in bacterial genomes and identified DtxR/IdeR associated regulons in corynebacteria and mycobacteria [1–5]. At present we are interested in developing strategies to identify transcriptional units and operons of mycobacteria.
It is well known that genes belonging to the same operon are transcribed as a single mRNA molecule in all prokaryotes. Transcription starts as the RNA polymerase binds to the promoter and continues until it reaches a transcriptional terminator. The genes of the same operon are believed to be involved in similar metabolic and physiological processes. Hence operon prediction also provides important clues to the functional relationships between the operonic genes, which can then be taken up by the experimental biologist for further validation.
A number of computational and experimental approaches are being attempted to find out which all genes are together in a genome to perform a physiological function. Among experimental approaches, RNAse Protection Assay, Dot Blot or Real Time PCR are generally used to define operon boundaries [6–9] but using these techniques for all the genes of a genome is expensive affair. A number of computational methods have been published for operon prediction [10–13] and a number of genome specific databases are also available that provide genome wide operon information [14, 15].
Recently a database ODB was published  which has known and putative operons of many prokaryotic species including mycobacteria. However many mycobacterial transcriptional units and operons, even some known operons, are missing in this database. The advance search option requires great labor and expertise as well as external information from an average microbiologist which the latter may find difficult to provide. Therefore, there is a need to carryout more focused prediction of transcriptional units and operon in a group of related microorganisms. Such prediction and the resultant specialized database are likely to be more useful for specific research domain than global predictions. A more focused prediction in an organism also allows the researcher to revisit, track development regularly and update these databases as the research progresses in the field. Good examples of such databases are RegulonDB for E coli, DBTBS for B. subtilis and PlasmoDB for Plasmodia [15, 14, 17].
We present here a promising mycobacterial database MycoperonDB, which has all known data related to mycobacterial genes, including gene sequences, encoded protein sequences, known promoters, known & predicted operons and related pubmed links. These data are precomputed so that all information can be quickly accessed. The definitions of the different terms used in transcriptomics as well as one or two lines description of the important mycobacterial genes have been given on the help page as glossary. The position of different important motifs can also be searched in this database. This database will be significantly useful for the researchers working with mycobacteria. This database is an ongoing effort to increase the coverage of more and more mycobacterial species, as and when their genome sequences become available. Some of these species include Mycobacterium smegmatis, Mycobacterium w etc. At present, around 8256 operons are being reported in 5 mycobacterial genomes which include M.tuberculosis H37Rv, M.tuberculosis CDC1551, M.bovis, M.avium, M.leprae.
The overall process of transcription units and operon prediction involved multiple stages. Perl Scripts were written and used at every stage of operon prediction. These stages are-
The complete genome sequences of all species of mycobacteria with original annotations were downloaded from NCBI .
Genes which can be part of same operon must have same orientation. Considering this, all adjacent genes with same orientation were identified and grouped together.
Genes in an operon are often closely located on the genome as compared to those which are not in the same operon. Hence after orientation, this is another indicator to identify the operons. The intergenic distances between adjacent genes in the same orientation were calculated from the corresponding coordinates using the formula: distancePQ = gene Q start position – gene P end position. In general genes were passed to next operon if distance was greater than 300 bp. This cut-off was taken from E.coli operon prediction .
Transcription terminator site is a site where transcription terminates. Genes flanking the terminator site cannot be in the same operon. GCG Terminator program from the GCG Wisconsin software package was used to identify rho-independent transcriptional terminators. Output of GCG Terminator program was parsed for S-value >0. Finally those terminators were considered which were in the region between -20 to +200 nucleotide around the stop-codon of each mycobacterial gene of an operon (operon boundary after step 1). The genes having the terminator sites at the end were considered as end of the transcription units and operons.
Conserved gene clusters among genomes were identified as orthologs either on the basis of gene orders or on the basis of clusters of orthologous groups (COGs). If conserved gene clusters (adjacent genes with same orientation grouped together in more than one species) were found, then intergenic distance criteria as well as terminator criteria was relaxed, i.e. if the genes are clustered among species, they were kept in one operon.
We scanned mycobacteria literature for reports on known transcription units, operons, promoters, and transcription start points of individual mycobacterial genes. Pubmed Id of these identified literatures was integrated with our computational prediction, for the easy and quick browsing of the articles having detailed information on promoter and operon characterization. For the published information on promoters in any one species of mycobacteria, the homologous sequences in other species were searched computationally. The search results were also incorporated in the table with the same pubmed ID.
We structured our data in the form of database. A relational database, MycoperonDB, was constructed using MySQL database management system (DBMS) to store and manage all information. MycoperonDB is currently composed of 6 tables. At present this database has information for only those mycobacterial species whose genomes are published and are available at NCBI but the same method can be used to extend the database to other genomes.
MycoperonDB aims to provide a platform to the researchers interested in mycobacteria, for a quick overview of operon and transcription unit organization of a given gene and all the related literature information like position of promoters/tsps, pubmed links, sequences of individual genes, and definition of most of the terms of mycobacterial gene regulatory circuits. A help page is also provided to guide the users step by step through the database.
The user can type any motif of interest in the search box and MycoperonDB returns the position of that motif in the whole genome. The search can be done either in one species or in more species to know the homologs of the motif across the species. If the position of the motif does not fall in the upstream region (-500 bases) of any gene, then the result page declares no operon context.
We have extensively searched literature to find out the known mycobacterial operons to test how much the predictions are deviated from the actual operons. In most of the cases the predictions were in agreement with the experimental observations. For example, mceI operon has been shown to be transcribed as a 13 gene polycistronic message in M. tuberculosis  which is in agreement with our prediction. In our H37Rv operon table Rv0166 to Rv0178 are together. Virulence operon in M tuberculosis has been reported  and when checked in our operon table, all three genes Rv0986 to Rv0988 of this operon were found to be together. Similarly there are a number of examples like, embCAB operon , ini operon , mymA operon , kasA operon [25–27] etc for which our predictions were found to be correct.
In few cases, such as nat operon reported in M bovis , devR operon, ent operon etc reported in M.tuberculosis [29, 30], our prediction shows a few additional genes than reported which needs to be checked experimentally.
We have predicted transcriptional units and operons in mycobacteria and organized these predictions in the form of a relational database called MycoperonDB. We further provide additional information about known and experimentally demonstrated operons, promoters and their literature links. The strengths of this database are in its simplicity, its free web accessibility, its specificity, its comprehensiveness for published mycobacterial genomes and its interactive graphical interface. This database is part of our broad effort to characterize regulons, operons and transcriptional units in mycobacteria. This database can be a practical solution for the complexity of mycobacterial genome and it is expected to assist molecular biologists as well as microbiologists dealing with mycobacteria.
Research in AR's laboratory is supported by grants from the Department of Biotechnology, Council of Scientific & Industrial Research (CSIR) Govt. of India. SR is supported by Postdoctoral Research Fellowship from Department of Biotechnology, Govt of India.
This article has been published as part of BMC Bioinformatics Volume 7, Supplement 5, 2006: APBioNet – Fifth International Conference on Bioinformatics (InCoB2006). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/7?issue=S5
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.