Orthology-driven mapping of bidirectional promoters in human and mouse genomes
© Yang and Elnitski; licensee BioMed Central Ltd. 2014
Published: 16 December 2014
The presence of bidirectional promoters in all vertebrate species suggests that the promoters may be maintained in orthologous positions. Therefore the identification of the comprehensive orthologous mapping of this type promoter across species can facilitate elucidation of regulatory mechanisms controlling bidirectional gene expression. However, the lack of annotation for many transcribed regions in the genome can impact the orthology designation of these promoters. Human and mouse are among genomes that have been relatively well annotated. Thus we used them as models to study the orthologous patterns of bidirectional promoters.
We developed a method to annotate these regulatory regions by confirming the orthology of the genes found on each side of the promoters. In this manuscript we report the cross-species comparisons between human and mouse genomes, where the bidirectional promoter sets regulating UCSC Known Genes and spliced EST annotations were mapped from human to mouse and vice versa. We validate hundreds of orthologous bidirectional promoters through the presence of orthologous flanking gene annotations in the second species. We also show that regulatory activity of these orthologous promoters confers similar gene expression profiles in 21 tissues of human and mouse. In particular, more than one third of human bidirectional promoters annotated from spliced EST annotations regulate ncRNA, of which over 90% are lncRNAs.
Although evolutionary conservation shows a weaker signature in promoters than coding regions, our technique of mapping of orthologous genes shows that most bidirectional promoter arrangements are conserved across human and mouse genomes, suggesting a critical function. In addition, the similar expression patterns of the orthologous gene sets indicate that the regulatory mechanisms remain largely conserved as well.
Bidirectional promoters are the regulatory regions that fall between pairs of genes, where the 5' ends of the genes within a pair are positioned in close proximity to one another. This spacing facilitates the initiation of transcription of both genes, creating two transcription forks that advance in opposite directions. The formal definition of a bidirectional promoter requires that the transcription initiation sites are separated by no more than 1,000 bp from one another. Using these criteria we have comprehensively annotated the human and mouse genomes for the presence of bidirectional promoters, using in silico approaches [1, 2]. The identification of these promoters is contingent upon the presence of adjacent, oppositely oriented pairs of genes, whose orthology assignments are quantitatively stronger than noncoding regions. This approach allows us to uniquely identify bidirectional promoters de novo [1, 3] and does not require tissue-specific epigenetic data that cannot be easily compared across tissues of different species. Genomic annotations used for our identification phase include (1) curated protein-coding gene annotations and (2) spliced ESTs (spESTs) and (3) 5' "end-capped" transcript data, e.g., Cap-Analysis of Gene Expression Database (i.e., CAGE) . The annotations for protein coding genes are robust with certainty and therefore provide a high quality dataset for mapping bidirectional promoters. In contrast, bidirectional promoters supported by RNA evidence alone (as in (2)) have varying levels of evidence, ranging from one characterized transcript to hundreds of them. For this reason, dataset (3) - the CAGE data - provides a stringent level of validation for the start sites of the EST transcripts. As a large class of regulatory sequences, bidirectional promoters exemplify a rich source of unexplored biological information in the human genome. Here, we show that when compared to the mouse genome, these promoters are identifiable as truly orthologous locations, being maintained in regions of conserved synteny (including both genes and the intervening promoter region) that have undergone no rearrangements since the last common ancestor of humans and mice, 75 million years ago. These analyses represent a unique approach to identifying orthologous promoter regions with a high level of certainty.
Bidirectional promoter identification
Bidirectional gene pairs in human protein-coding genes
Bidirectional promoters identified by spliced ESTs
A richer source of bidirectional promoter evidence was present in the spliced EST data. This dataset contained more abundant data than the protein-coding gene set, requiring 7 million transcripts to be condensed into unique, non-overlapping loci . The complexity of this data required that we use a stringent approach of classifying potential bidirectional promoters to avoid false positive predictions. We developed and implemented a rigorous mapping procedure to identify such promoters . Using the spEST data from the UCSC Genome Browser (requiring at least one canonical intron) we detected 2,939 additional bidirectional promoters not detectable via the protein-coding gene annotations.
When the transcription start sites of these bidirectional transcripts were compared to the CAGE transcripts, they showed a similar pattern as Known Genes data (Figure 1B). Furthermore, using the CAGE data, we found that 66% of the genes in bidirectional gene pairs exhibited coordinated transcriptional activation, having a CAGE tag at the left and right TSS in the tissues examined.
Bidirectional promoter annotation in the mouse genome
Orthologous bidirectional promoter identification
Assigning orthologous regions
As coding regions have the strongest orthologous alignment signal compared with other genomics regions, we used orthology of adjunct gene pairs as anchors to assess the ancestral relatedness of the intervening bidirectional promoters. The orthology of genes was determined using chain and net data from UCSC Genome Browser. Chains in the Genome Browser represent sequences of gapless aligned blocks. Nets provide a hierarchical ordering of those chains. Level 1 chains contain the longest, best-scoring sequence chains that span any selected region. Subsequent levels in the net represent the results of rearrangements, duplications, insertions and deletions that may have disrupted the presence of conserved synteny derived from an ancestral sequence.
Confirming orthologous genes
After determining the orthology assignments using the UCSC chains and nets data, we used the Known Gene annotations or spliced ESTs to search the identity of genes within the corresponding region. Known Genes represent protein-coding genes and therefore orthology can be verified by chains and nets alignments, followed by confirmation of protein identity in both species. Spliced ESTs carry less descriptive information than protein coding genes and therefore cross-species comparisons require their presence in an orthologous position, showing conserved synteny of two transcripts forming a divergent pair and meeting the criteria of less than 1,000 bp of intergenic distance between those transcripts. Our method for mapping bidirectional promoters in the spliced EST datasets is described in more detail in a previous publication . When our program verified evidence for orthology and conserved-syntenic gene arrangement, the orthologous bidirectional promoter was confirmed. After orthologous assignments were confirmed in mouse for pairs of human genes, the reciprocal assignments were analyzed from mouse back to human.
Orthology mapping of bidirectional promoters from human to mouse
Within a species, annotated transcripts provide critical evidence for identifying bidirectional promoters. Across species, over 90% of the human and mouse genomes can be partitioned into corresponding regions of conserved synteny . We hypothesized that conserved synteny of bidirectional gene pairs predicts the presence of orthologous bidirectional promoters. Thus missing annotations at the 5' ends of genes could be predicted from comparisons to a second species. We developed a methodology to examine orthologous locations of the pairs of genes and their intergenic promoter regions in a second species as a method of prediction, discovery and validation.
the human gene has an ortholog in mouse and that ortholog has a bidirectional partner within 1,000 bp that is an ortholog of the gene partner in human
the human gene has an ortholog in mouse and that ortholog has a bidirectional partner within 1,000 bp that is not an ortholog of the gene partner in human
the human gene has an ortholog in mouse and that ortholog is missing a bidirectional partner within 1,000 bp
a non-orthologous gene was mapped to the corresponding mouse location
no orthology was recorded in the mouse genome
Figure 3B shows the results of mapping bidirectional promoters from the human EST dataset to the mouse. In this case, fewer orthologous promoters were identified, 12.4%; nevertheless, the same trends were observed as before. For example, allowing a distance larger than the 1,000 bp intergenic space between the transcription start sites in the mouse genes validated a larger number of orthologous gene positions, 15.6%. A large number of examples had evidence for only one orthologous transcript, 23%. The combination of the two datasets (Known Genes and spliced ESTs) increased the number of orthologous promoters modestly.
Orthology mapping of bidirectional promoters from mouse to human
Distribution of orthologous bidirectional promoters on chromosomes
On a per chromosome basis, genes regulated by bidirectional promoters were not evenly distributed in either the human or mouse genomes (Figure 5, Additional file 2). However, their appearance was consistent with the allocation of genes per chromosome. For example, in the human genome, chromosome 13 has the lowest gene density (6.5 genes per Mb) among sequenced human autosomes , as well as one of the lowest numbers of bidirectional promoters. In contrast, chromosome 19 is the most gene-rich of all human chromosomes  and has a high number of bidirectional promoters. Chromosome 16 had the highest ranking, containing over 52% of bidirectional promoters in human that were confirmed as orthologous in mouse. Those promoters currently showing no orthologous evidence represent either species-specific differences between human and mouse gene sets (true negatives) or missing annotations from the Known Genes dataset in mouse (false negatives). We observed a higher overall confirmation of orthologous bidirectional promoters when mapped from mouse to human (Additional file 3), suggesting that the annotations may be more complete in human, and the mapping from human to mouse failed more often due to missing gene annotations in mouse. We will continue to update the datasets as gene annotations continue to be refined.
Orthologous promoters exhibit functional correlations
We have utilized the unique properties of bidirectional promoters to map orthologous regulatory regions. These promoters are flanked on each side by a spliced transcript. Therefore the presence of the orthologous genes in the same arrangement in another species identifies the intergenic promoter region as the orthologous promoter region. We have used this approach to map promoters from human to mouse without the aid of regulatory region sequence conservation to identify the orthologous promoter elements. Nevertheless sequence alignments were very important in defining the regions of orthology and conserved synteny. We show that orthologous regulatory regions can be identified using annotations of UCSC Known Genes or spliced ESTs. Furthermore, the combination of these datasets reveals additional promoter regions. By validating the predictions in the second species, we confirmed that bidirectional promoters are present in orthologous positions in mammalian genomes. Furthermore, we postulate that regions containing one of the genes, but not both, are likely to be missing the annotations for the partner gene. Thus we anticipate that as annotations grow more populated and refined, the data shown in our heat maps will confirm orthology at even more promoters.
Bidirectional promoters are enriched in mammalian genomes. Our approach of investigating the orthology of bidirectional promoters reveals thousands of examples of this type of regulatory structure maintained through evolutionary selection. By combining spliced ESTs and Known Genese, we identified a larger and more comprehensive set of bidirectional promoters. We subsequently found that many of these spliced ESTs represent non-coding RNAs (ncRNA). This is consistent with recent reports that the majority of long non-coding RNA (lncRNA) are flanked by bidirectional promoters [10–12]. Thus, understanding regulatory mechanisms of bidirectional promoters can be useful in investigating ncRNAs whose functions remain largely unknown. The different types of bidirectional promoters we record based on annotation and orthology allow us to address the diversity of biological functions of these promoters.
Bidirectional gene pairs in human and mouse protein-coding genes
We downloaded protein coding gene and spliced EST annotations from UCSC Genome Browser . Assembly hg38 and mm10 were used for human and mouse genomes, respectively. We used 1,000 bp as the intergenic distance cut-off in defining a bidirectional promoter between two adjacent gene pairs. The major steps of identifying bidirectional promoters from spEST include extracting all bidirectional promoters genome wide, collapsing overlapping candidate promoter regions, filtering out false positives from the dataset and assigning confidence levels to these promoters.
Orthology mapping of bidirectional promoters
A multi-stage approach to mapping orthology at bidirectional promoters was developed. Orthology assignments are strongest in coding regions. Therefore we began by mapping single human genes regulated by bidirectional promoters onto the mouse genome. Orthology assignments were determined using the "chains and nets" data from the UCSC Human Genome Browser MySQL tables . We used orthologous regions present in only level 1 chains and excluded any other levels, which contained both paralogous (duplicated during evolution) and orthologous sequences. Level 1 alignments also contained extremely long stretches of genes in conserved synteny (i.e. same gene identity and location) between species. Given a human gene, our approach examined whether it fell within an orthologous region defined by level 1 alignment data without knowledge of the exact position within an alignment or relative to a gap. In a subsequent step, we intersected the positions of gaps and exons of each gene to ensure that the exons fell into alignable positions across species.
This work and publication were funded by the intramural research program of the National Human Genome Research Institute, National Institutes of Health (NIH). In addition, MQY was also supported by NIH 5P20GM10342913 and ASTA award # 15-B-23.
This article has been published as part of BMC Bioinformatics Volume 15 Supplement 17, 2014: Selected articles from the 2014 International Conference on Bioinformatics and Computational Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/15/S17.
- Yang M, Elnitski L: A computational study of bidirectional promoters in the human genome. Springer Lecture Series: Lecture Notes in Bioinformatics. 2007Google Scholar
- Yang M, Elnitski L: Orthology of Bidirectional Promoters Enables Use of a Multiple Class Predictor for Discriminating Functional Elements in the Human Genome. Proceedings of BIOCOMP. 2007Google Scholar
- Yang MQ, Koehly LM, Elnitski LL: Comprehensive annotation of bidirectional promoters identifies co-regulation among breast and ovarian cancer genes. PLoS computational biology. 2007, 3 (4): e72-10.1371/journal.pcbi.0030072.PubMed CentralView ArticlePubMedGoogle Scholar
- Kawaji H, Kasukawa T, Fukuda S, Katayama S, Kai C, Kawai J, Carninci P, Hayashizaki Y: CAGE Basic/Analysis Databases: the CAGE resource for comprehensive promoter analysis. Nucleic acids research. 2006, 34 (Database): D632-636.PubMed CentralView ArticlePubMedGoogle Scholar
- Forrest AR, Kawaji H, Rehli M, Baillie JK, de Hoon MJ, Lassmann T, Itoh M, Summers KM, Suzuki H, Daub CO: A promoter-level mammalian expression atlas. Nature. 2014, 507 (7493): 462-470. 10.1038/nature13182.View ArticlePubMedGoogle Scholar
- Andersson R, Gebhard C, Miguel-Escalada I, Hoof I, Bornholdt J, Boyd M, Chen Y, Zhao X, Schmidl C, Suzuki T: An atlas of active enhancers across human cell types and tissues. Nature. 2014, 507 (7493): 455-461. 10.1038/nature12787.View ArticlePubMedGoogle Scholar
- Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P: Initial sequencing and comparative analysis of the mouse genome. Nature. 2002, 420 (6915): 520-562. 10.1038/nature01262.View ArticlePubMedGoogle Scholar
- Semple CA: Deep genomics in shallow times: the finished sequence of human chromosomes 13 and 19. European journal of human genetics : EJHG. 2004, 12 (11): 875-876. 10.1038/sj.ejhg.5201254.View ArticlePubMedGoogle Scholar
- Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G: A gene atlas of the mouse and human protein-encoding transcriptomes. Proceedings of the National Academy of Sciences of the United States of America. 2004, 101 (16): 6062-6067. 10.1073/pnas.0400782101.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang A, Xu M, Mo YY: Role of the lncRNA-p53 regulatory network in cancer. Journal of molecular cell biology. 2014, 6 (3): 181-191. 10.1093/jmcb/mju013.PubMed CentralView ArticlePubMedGoogle Scholar
- Gupta RA, Shah N, Wang KC, Kim J, Horlings HM, Wong DJ, Tsai MC, Hung T, Argani P, Rinn JL: Long non-coding RNA HOTAIR reprograms chromatin state to promote cancer metastasis. Nature. 2010, 464 (7291): 1071-1076. 10.1038/nature08975.PubMed CentralView ArticlePubMedGoogle Scholar
- Prensner JR, Chinnaiyan AM: The emergence of lncRNAs in cancer biology. Cancer discovery. 2011, 1 (5): 391-407. 10.1158/2159-8290.CD-11-0209.PubMed CentralView ArticlePubMedGoogle Scholar
- Hsu F, Kent WJ, Clawson H, Kuhn RM, Diekhans M, Haussler D: The UCSC Known Genes. Bioinformatics (Oxford, England). 2006, 22 (9): 1036-1046. 10.1093/bioinformatics/btl048.View ArticleGoogle Scholar
- Hinrichs AS, Karolchik D, Baertsch R, Barber GP, Bejerano G, Clawson H, Diekhans M, Furey TS, Harte RA, Hsu F: The UCSC Genome Browser Database: update 2006. Nucleic acids research. 2006, 34 (Database): D590-598.PubMed CentralView ArticlePubMedGoogle Scholar
- Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D: Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proceedings of the National Academy of Sciences of the United States of America. 2003, 100 (20): 11484-11489. 10.1073/pnas.1932072100.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.